Water | Free Full-Text | Next-Generation DNA Barcoding for Fish Identification Using High-Throughput Sequencing in Tai Lake, China

1. Introduction

It is well recognized that the loss of biodiversity caused by environmental deterioration has adverse ecological consequences [1], for example, nutrient over-enrichment in lakes leads to the simplification of flora and fauna [2]. Biodiversity conservation would be a usual tool for sustaining the health of an ecosystem [3]. Species delimitation, especially identifying functional species, is the first step to understanding the relationship between biodiversity and ecological services [4,5,6].
A conventional taxonomist uses the morphological method to describe species, and misidentification often occurred because of phenotypic plasticity and differing life stages [7,8]. In addition, traditional fish surveys generally involve capturing organisms, are invasive for the biological community under study, and conflict with the original intention of biodiversity conservation. The DNA barcode, usually a short and standardized sequence, has emerged as a cost-effective proxy for species identification [9,10], which significantly improves the efficiency of biomonitoring [11,12]. Incomplete reference databases of the DNA barcode, especially for native species, are widely recognized as a major obstacle to the use of molecular-based methods.
The barcode database that was built needed a PCR procedure firstly, which amplified not only target region but also non-intended fragments. Subsequently, Sanger sequencing was used to obtain the target barcode sequence. Although Sanger sequencing is very popular and is low-cost, it can only provide a single sequence for each specimen. The single sequence from Sanger sequencing can be the PCR product of the co-amplification of contaminate DNA and may not represent the ‘true’ barcode sequence, which increases the risk of failure [13].
High-throughput sequencing (HTS) allows for sequencing millions of DNA fragments in parallel and provides an opportunity to reveal the sequence composition of the PCR product [14]. In addition, HTS allows for the generation of multiple sequences for a single specimen and provides an opportunity to identified the contamination. HTS barcoding not only enhances the accuracy of specimen identification but also accelerates the process of barcode capturing [14,15]. The Ion Torrent sequencing platform, in particular, can provide tens of millions of sequences within 10 h for the barcode testing of hundreds of specimens [16,17]. It greatly improves the efficiency of database construction and reduces the cost. In this study, we established a DNA barcoding library of fish from Tai Lake by using HTS and analyzed the cytochrome c oxidase I (COI) barcode gap among fish. The significance of the library was to provide a database for fishery resource surveillance and promote the taking of more careful measures in the conservation of fishery biodiversity.

3. Results

A total of 149,892 raw reads were generated by HTS (Figure 3). The raw reads ranged in length from 19 to 420 bp. After filtering quality and trimming MIDs and primers, 108,689 sequences remained, and an average of 603 sequences was assigned to each specimen. Following indel, contamination, and substitution removal, a total of 34,733 error free reads remained. More than 64% of the error reads were generated by indel, and fewer than 500 reads were contamination reads, including fish cross- contamination and parasites. In each specimen, PGM HTS sequencing errors were proportional to sequencing reads (Figure 4). Owing to the degeneration of primers, some specimens had minor raw reads. Generally, at least 100 error-free reads could successfully be obtained by PGM sequencing per specimen (Figure 4). However, if fish tissues were contaminated by parasites, this would be judged a failure to obtaining the ‘‘true’’ barcode. A total of 9% and 26% of specimens could not obtain ‘‘true’’ barcode sequences with HTS and Sanger sequencing, respectively (Figure 4). Finally, 163 “true” barcode sequences recovered by PGM sequencing. All barcodes were of a full 313 bp length (Figure 3). Each specimen had only one unique sequence that remained, meaning that an average of 231 error free reads was assigned to each “true” barcode.
All 163 barcodes recovered by HTS belonged to 33 of 34 a priori morphologically identified species, whereas the Tridentiger bifasciatus species could not be recovered. It was expected that K2P-based genetic variation hierarchically increased from within species (mean = 0.3%, SE = 0.03), to within genera (mean = 3.82%, SE = 0.23), and to within families (mean = 17.40%, SE = 0.09) (Table 1). Using nMDS to reduce the dimension of genetic distances within species, specimens within species were clustered respectively in the two dimensional plot. Cypriniformes was separated from other orders. Of specimens in the Cypriniformes order, the Acheilognathus family was separated from other families (Figure 5). Overall, a comparison between the maximum distance intra- species and the minimum distance inter-species demonstrated that a barcode gap existed in all of the analyzed specimens (Figure 6).
The phylogenetic tree based on the K2P distance contained 33 species clusters. The number of OTUs produced by the ABGD method was manifested by a red circle outside the NJ tree (Figure 7). The ABGD analysis produced nine initial partitions. The number of groups and the p distance were in the range of 32 to 49 and of 0.059948 to 0.001000, respectively. The result of 34 OTUs (p distance = 0.012915) was chosen to set the threshold to delimit species boundaries since it was concordant with the outcome of NJ analysis. Of these 34 OTUs, two species, Misgurnus anguillicaudatus and Monopterus albus, were delimited to two theoretical species, respectively, whereas Megalobrama amblycephala and Megalobrama skolkovii were clustered into one candidate species. Other species boundaries represented by OTUs were concordant with morphologically identified species.

4. Discussion

Some major questions in ecology, such as what constitutes the dietary range of a fish species and the assembly of ecological communities, are hampered by traditional morphological identification, owing to laborious work [34,35]. DNA barcoding, integrating ecological, morphological, and generic data, is anticipated to bring the renaissance of taxonomy [36]. The barcode library of fish species established here not only provided an effective tool in identifying fish communities in Tai Lake [37] but also accurately measured the dietary range of some functional fish species.
Sanger sequencing is the dominant approach for obtaining barcode sequences and has been applied to establish a wide range of barcode libraries, from phytoplankton to vertebrate [11,37,38,39,40]. However, no-amplification and co-amplification of non-target sequences usually occur and decrease the efficiency and accuracy of the capture of the “true” barcode [41]. In this study, 26% of the analyzed specimens could not be sequenced by Sanger sequencing. When the HTS approach was used instead, the failure rate was reduced to 9%. Because “touch-down” PCR was performed, the HTS approach could obtain sequences from some specimens with low concentrations of amplifications where there was no measurable fluorescence detected by electropherogram [42]. The HTS approach increased the sensitivity of the capture of the DNA barcode. A previous study where an average of 143 sequences per specimen were generated by the HTS approach provided the proof for our research [15].
The ion torrent PGM sequencer was chosen owing to more time-saving in comparison with other sequencers, such as Illumina HiSeq and 454 Junior [17]. However, every sequencer introduces errors into the read results. Because of the Ion Torrent based on pH flow call technology, the indel errors occur at very high frequencies when sequencing in massively parallel [43]. A SEAME tool was used to remove the error reads in bioinformatics [44,45]. We chose an irreproachable COI fragment sequenced by the Sanger approach in a bidirectional way as a gold template for BLAST, with all reads generated by the PGM [28]. Indel reads accounted for 64% of the total reads. In a previous study, the error rate was as high as 90%. The difference in the error rate may be due to the different uses of the sequencing kit or the chip density [43].
The biggest advantage of the HTS is the successful acquisition of hundreds of reads in parallel [14], allowing us to thoroughly understand the sequence composition in a single specimen, similarly to multiple clones in Sanger sequencing [46]. Previous studies have found that the “true” barcode was often confused with endosymbiotic bacteria (e.g., Wolbachia) [47], cross-contamination [15], and heteroplasmy [41]. In this study, one instance of parasite infection was detected in the species of cyprinidae. To our knowledge, parasites were first found as an error source that compromised attempts at the DNA barcoding of fish species. Cross-contamination was detected in all of the analyzed specimens. The reason there are no intended fish infections might be due to contact within specimens in a single fish net, which traditional fishery inventory often used [48]. The Sanger-based barcoding method could not discriminate between the cross-contaminations unless enough clones were involved before sequencing. This was laborious work, which the HTS could have easily solved. In the process of bioinformatics, no medium-high frequency unique was detected, that is, no heteroplasmy was detected in all specimens. In addition, the HTS process eliminates the need for post-clone sequencing, simplifies the procedure for obtaining barcodes, and reduces the cost of library construction by at least 30%.
This is the first comprehensive molecular assessment of the fish species in Tai Lake, including the most majority of known species up to now. The criteria of examining the discrimination power of DNA barcoding is based on the comparison of intraspecific and interspecific genetic distances [39], which were in the range of 0% to 9.35% and 0.32% to 23.9%, respectively, in this study. Here, the barcode gap analysis was 100% successful in all of the analyzed specimens. In previous studies, the species discrimination rate ranged from 88% to 95% [49,50,51,52]. The inability to differentiate between some instances may be due to cryptic species, which would lead to the incongruence between barcode and morphological identification [53]; another reason may be the haplotype sharing that caused by hybridization among species [37]. In any case, the 100% discrimination rate in this study demonstrated the perfect congruence between barcode sequences and adequate taxonomy.
The mean value of the intraspecific K2P distance of 0.3% (SE = 0.03) calculated here has been also shown for fish populations in the Nujiang River in Southwest China [37] and is in accordance with the value of below 1% calculated when COI was used as a barcoding maker [54,55]. Of the 33 species analyzed here, some extremely intraspecific distances with the highest value of 9.35% were found in the two species Misgurnus anguillicaudatus, and Monopterus albus and displayed deep divergence in the NJ tree.
The barcode library is established for the purpose of predicting unexplored fish specimens when captured from the Tai Lake again. Some relative prediction models are designed for delimiting species based on the similarity of the barcode sequence. The “10-fold rule” that all barcode sequences differing at a species level by the 10 fold of the average value of intraspecific distance was introduced as a standard threshold [56]. In this study, the average K2P distance between congeneric species was 13-fold that of the overall intraspecific distance. The threshold calculated here was lower than those reported for other fish barcoding studies, where the value ranged from 15-fold to 70-fold [38,39,51,57].
ABGD is another important prediction model for a first set of species hypotheses performed on the web interface [33]. The ABGD method grouped barcode sequences into 34 OTUs. It was well known that the two species Misgurnus anguillicaudatus and Monopterus albus displayed deep divergence in the NJ tree. In contrast, the extremely closed distance between the other two species Megalobrama amblycephala and Megalobrama skolkovii led to the generation of only one OTU. This was due to the asymmetry of divergent COI sequences [56]. Moreover, the difference between the two Megalobrama species is only one base of the 313 bp COI sequences. A character- based identification method performed by the BLOG model could be used to delimit them [58]. Overall, when the unexplored specimen is identified, if ABGD is involved in grouping sequences into candidate species, another taxonomic approach should be complemented for the purpose of obtaining a 100% success rate in fish species identification [12,59].

Read more here: Source link