DNA barcode data accurately identify higher taxa

The use of unique DNA sequences as a method for taxonomic identification is no longer fundamentally controversial, even though debate continues on the best markers, methods, and technology to use. Although both existing databanks such as GenBank and BOLD, as well as reference taxonomies, are imperfect, in best case scenarios “barcodes” (whether single or multiple, organelle or nuclear, loci) clearly are an increasingly fast and inexpensive method of identification, especially as compared to manual identification of unknowns by increasingly rare expert taxonomists. Because most species on Earth are undescribed, a complete reference database at the species level is impractical in the near term. The question therefore arises whether unidentified species can, using DNA barcodes, be accurately assigned to more inclusive groups such as genera and families—taxonomic ranks of putatively monophyletic groups for which the global inventory is more complete and stable. We used a carefully chosen test library of CO1 sequences from 49 families, 313 genera, and 816 species of spiders to assess the accuracy of genus and family-level identifications. We used BLAST queries of each sequence against the entire library and got the top ten hits resulting in 8160 hits. The percent sequence identity was reported from these hits (PIdent, range 75-100%). Accurate identification (PIdent above which errors totaled less than 5%) occurred for genera at PIdent values > 95 and families at PIdent values ≥ 91, suggesting these as heuristic thresholds for generic and familial identifications in spiders. Accuracy of identification increases with numbers of species/genus and genera/family in the library; above five genera per family and fifteen species per genus all identifications were correct. We propose that using percent sequence identity between conventional barcode sequences may be a feasible and reasonably accurate method to identify animals to family/genus. However, the quality of the underlying database impacts accuracy of results; many outliers in our dataset could be attributed to taxonomic and/or sequencing errors in BOLD and GenBank. It seems that an accurate and complete reference library of families and genera of life could provide accurate higher level taxonomic identifications cheaply and accessibly, within years rather than decades.

Download Full-text

DNA barcode data accurately identify higher taxa

10.7287/peerj.preprints.1633 ◽

2016 ◽

Author(s):

Jonathan A Coddington ◽

Ingi Agnarsson ◽

Ren-Chung Cheng ◽

Klemen Čandek ◽

Amy Driskell ◽

...

Keyword(s):

Dna Sequences ◽

Dna Barcode ◽

Accurate Method ◽

Reference Database ◽

Accurate Identification ◽

Sequencing Errors ◽

Sequence Identity ◽

Percent Sequence Identity ◽

Near Term ◽

Accuracy Of Results

Download Full-text

DNA barcode data accurately assign higher spider taxa

PeerJ ◽

10.7717/peerj.2201 ◽

2016 ◽

Vol 4 ◽

pp. e2201 ◽

Cited By ~ 12

Author(s):

Jonathan A. Coddington ◽

Ingi Agnarsson ◽

Ren-Chung Cheng ◽

Klemen Čandek ◽

Amy Driskell ◽

...

Keyword(s):

Dna Sequences ◽

Dna Barcode ◽

Accurate Method ◽

Reference Database ◽

Sequencing Errors ◽

Sequence Identity ◽

Percent Sequence Identity ◽

Near Term ◽

Accuracy Of Results ◽

Global Inventory

The use of unique DNA sequences as a method for taxonomic identification is no longer fundamentally controversial, even though debate continues on the best markers, methods, and technology to use. Although both existing databanks such as GenBank and BOLD, as well as reference taxonomies, are imperfect, in best case scenarios “barcodes” (whether single or multiple, organelle or nuclear, loci) clearly are an increasingly fast and inexpensive method of identification, especially as compared to manual identification of unknowns by increasingly rare expert taxonomists. Because most species on Earth are undescribed, a complete reference database at the species level is impractical in the near term. The question therefore arises whether unidentified species can, using DNA barcodes, be accurately assigned to more inclusive groups such as genera and families—taxonomic ranks of putatively monophyletic groups for which the global inventory is more complete and stable. We used a carefully chosen test library of CO1 sequences from 49 families, 313 genera, and 816 species of spiders to assess the accuracy of genus and family-level assignment. We used BLAST queries of each sequence against the entire library and got the top ten hits. The percent sequence identity was reported from these hits (PIdent, range 75–100%). Accurate assignment of higher taxa (PIdent above which errors totaled less than 5%) occurred for genera at PIdent values >95 and families at PIdent values ≥ 91, suggesting these as heuristic thresholds for accurate generic and familial identifications in spiders. Accuracy of identification increases with numbers of species/genus and genera/family in the library; above five genera per family and fifteen species per genus all higher taxon assignments were correct. We propose that using percent sequence identity between conventional barcode sequences may be a feasible and reasonably accurate method to identify animals to family/genus. However, the quality of the underlying database impacts accuracy of results; many outliers in our dataset could be attributed to taxonomic and/or sequencing errors in BOLD and GenBank. It seems that an accurate and complete reference library of families and genera of lifecouldprovide accurate higher level taxonomic identifications cheaply and accessibly, within years rather than decades.

Download Full-text

Overcoming limitations to environmental DNA studies: A coastal temperate reference sequence database for multiple chloroplast gene regions generated in a single assay.

10.22541/au.163252330.05592688/v1 ◽

2021 ◽

Author(s):

Nicole Foster ◽

Kor-jent Dijk ◽

Ed Biffin ◽

Jennifer Young ◽

Vicki Thomson ◽

...

Keyword(s):

Dna Sequences ◽

Dna Barcode ◽

Environmental Dna ◽

Reference Sequence ◽

Reference Database ◽

Chloroplast Gene ◽

Coastal Plants ◽

Reference Databases ◽

Targeted Capture ◽

Comprehensive Reference

A proliferation in environmental DNA (eDNA) research has increased the reliance on reference sequence databases to assign unknown DNA sequences to known taxa. Without comprehensive reference databases, DNA extracted from environmental samples cannot be correctly assigned to taxa, limiting the use of this genetic information to identify organisms in unknown sample mixtures. For animals, standard metabarcoding practices involve amplification of the mitochondrial Cytochrome-c oxidase subunit 1 (CO1) region, which is a universally amplifyable region across majority of animal taxa. This region, however, does not work well as a DNA barcode for plants and fungi, and there is no similar universal single barcode locus that has the same species resolution. Therefore, generating reference sequences has been more difficult and several loci have been suggested to be used in parallel to get to species identification. For this reason, we developed a multi-gene targeted capture approach to generate reference DNA sequences for plant taxa across 20 target chloroplast gene regions in a single assay. We successfully compiled a reference database for 93 temperate coastal plants including seagrasses, mangroves, and saltmarshes/samphire’s. We demonstrate the importance of a comprehensive reference database to prevent species going undetected in eDNA studies. We also investigate how using multiple chloroplast gene regions impacts the ability to discriminate between taxa.

Download Full-text

Database establishment for the secondary fungal DNA barcodetranslational elongation factor 1α(TEF1α)

Genome ◽

10.1139/gen-2018-0083 ◽

2019 ◽

Vol 62 (3) ◽

pp. 160-169 ◽

Cited By ~ 8

Author(s):

Wieland Meyer ◽

Laszlo Irinyi ◽

Minh Thuy Vi Hoang ◽

Vincent Robert ◽

Dea Garcia-Hermoso ◽

...

Keyword(s):

Dna Sequences ◽

Fungal Infections ◽

Dna Barcode ◽

Elongation Factor ◽

Its Region ◽

Reference Sequence ◽

Diagnostic Tools ◽

Reference Database ◽

Elongation Factor 1Α ◽

Fungal Dna

With new or emerging fungal infections, human and animal fungal pathogens are a growing threat worldwide. Current diagnostic tools are slow, non-specific at the species and subspecies levels, and require specific morphological expertise to accurately identify pathogens from pure cultures. DNA barcodes are easily amplified, universal, short species-specific DNA sequences, which enable rapid identification by comparison with a well-curated reference sequence collection. The primary fungal DNA barcode, ITS region, was introduced in 2012 and is now routinely used in diagnostic laboratories. However, the ITS region only accurately identifies around 75% of all medically relevant fungal species, which has prompted the development of a secondary barcode to increase the resolution power and suitability of DNA barcoding for fungal disease diagnostics. The translational elongation factor 1α (TEF1α) was selected in 2015 as a secondary fungal DNA barcode, but it has not been implemented into practice, due to the absence of a reference database. Here, we have established a quality-controlled reference database for the secondary barcode that together with the ISHAM-ITS database, forms the ISHAM barcode database, available online at http://its.mycologylab.org/ . We encourage the mycology community for active contributions.

Download Full-text

Identification of Neoceratitis asiatica (Becker) (Diptera: Tephritidae) based on morphological characteristics and DNA barcode

Zootaxa ◽

10.11646/zootaxa.4363.4.7 ◽

2017 ◽

Vol 4363 (4) ◽

pp. 553

Author(s):

SHAOKUN GUO ◽

JIA HE ◽

ZIHUA ZHAO ◽

LIJUN LIU ◽

LIYUAN GAO ◽

...

Keyword(s):

Dna Sequences ◽

Phylogenetic Trees ◽

Gap Analysis ◽

Dna Barcode ◽

Morphological Characteristics ◽

Coi Gene ◽

Economic Losses ◽

Morphological Identification ◽

Lycium Barbarum ◽

Accurate Identification

Neoceratitis asiatica (Becker), which especially infests wolfberry (Lycium barbarum L.), could cause serious economic losses every year in China, especially to organic wolfberry production. In some important wolfberry plantings, it is difficult and time-consuming to rear the larvae or pupae to adults for morphological identification. Molecular identification based on DNA barcode is a solution to the problem. In this study, 15 samples were collected from Ningxia, China. Among them, five adults were identified according to their morphological characteristics. The utility of mitochondrial DNA (mtDNA) cytochrome c oxidase I (COI) gene sequence as DNA barcode in distinguishing N. asiatica was evaluated by analysing Kimura 2-parameter distances and phylogenetic trees. There were significant differences between intra-specific and inter-specific genetic distances according to the barcoding gap analysis. The uncertain larval and pupal samples were within the same cluster as N. asiatica adults and formed sister cluster to N. cyanescens. A combination of morphological and molecular methods enabled accurate identification of N. asiatica. This is the first study using DNA barcode to identify N. asiatica and the obtained DNA sequences will be added to the DNA barcode database.

Download Full-text

Documenting decapod biodiversity in the Caribbean from DNA barcodes generated during field training in taxonomy

Biodiversity Data Journal ◽

10.3897/bdj.8.e47333 ◽

2020 ◽

Vol 8 ◽

Cited By ~ 2

Author(s):

Dagoberto Venera-Pontón ◽

Amy Driskell ◽

Sammy De Grave ◽

Darryl Felder ◽

Justin Scioli ◽

...

Keyword(s):

Cryptic Species ◽

Dna Sequences ◽

Marine Invertebrates ◽

Graduate Training ◽

Dna Barcode ◽

Reference Database ◽

Diagnostic Features ◽

Operational Taxonomic Units ◽

As Species ◽

Coi Sequences

DNA barcoding is a useful tool to identify the components of mixed or bulk samples, as well as to determine individuals that lack morphologically diagnostic features. However, the reference database of DNA barcode sequences is particularly sparsely populated for marine invertebrates and for tropical taxa. We used samples collected as part of two field courses, focused on graduate training in taxonomy and systematics, to generate DNA sequences of the barcode fragments of cytochrome c oxidase subunit I (COI) and mitochondrial ribosomal 16S genes for 447 individuals, representing at least 129 morphospecies of decapod crustaceans. COI sequences for 36% (51/140) of the species and 16S sequences for 26% (37/140) of the species were new to GenBank. Automatic Barcode Gap Discovery identified 140 operational taxonomic units (OTUs) which largely coincided with the morphospecies delimitations. Barcode identifications (i.e. matches to identified sequences) were especially useful for OTUs within Synalpheus, a group that is notoriously difficult to identify and rife with cryptic species, a number of which we could not identify to species, based on morphology. Non-concordance between morphospecies and barcode OTUs also occurred in a few cases of suspected cryptic species. As mitochondrial pseudogenes are particularly common in decapods, we investigate the potential for this dataset to include pseudogenes and discuss the utility of these sequences as species identifiers (i.e. barcodes). These results demonstrate that material collected and identified during training activities can provide useful incidental barcode reference samples for under-studied taxa.

Download Full-text

The mutL Gene as a Genome-Wide Taxonomic Marker for High Resolution Discrimination of Lactiplantibacillus plantarum and Its Closely Related Taxa

Microorganisms ◽

10.3390/microorganisms9081570 ◽

2021 ◽

Vol 9 (8) ◽

pp. 1570

Author(s):

Chien-Hsun Huang ◽

Chih-Chieh Chen ◽

Yu-Chun Lin ◽

Chia-Hsuan Chen ◽

Ai-Yun Lee ◽

...

Keyword(s):

16S Rrna ◽

16S Rrna Gene ◽

Target Genes ◽

Marker Genes ◽

Rrna Gene ◽

Accurate Identification ◽

Discrimination Power ◽

Sequence Identity ◽

Genome Wide ◽

A Genome

The current taxonomy of the Lactiplantibacillus plantarum group comprises of 17 closely related species that are indistinguishable from each other by using commonly used 16S rRNA gene sequencing. In this study, a whole-genome-based analysis was carried out for exploring the highly distinguished target genes whose interspecific sequence identity is significantly less than those of 16S rRNA or conventional housekeeping genes. In silico analyses of 774 core genes by the cano-wgMLST_BacCompare analytics platform indicated that csbB, morA, murI, mutL, ntpJ, rutB, trmK, ydaF, and yhhX genes were the most promising candidates. Subsequently, the mutL gene was selected, and the discrimination power was further evaluated using Sanger sequencing. Among the type strains, mutL exhibited a clearly superior sequence identity (61.6–85.6%; average: 66.6%) to the 16S rRNA gene (96.7–100%; average: 98.4%) and the conventional phylogenetic marker genes (e.g., dnaJ, dnaK, pheS, recA, and rpoA), respectively, which could be used to separat tested strains into various species clusters. Consequently, species-specific primers were developed for fast and accurate identification of L. pentosus, L. argentoratensis, L. plantarum, and L. paraplantarum. During this study, one strain (BCRC 06B0048, L. pentosus) exhibited not only relatively low mutL sequence identities (97.0%) but also a low digital DNA–DNA hybridization value (78.1%) with the type strain DSM 20314T, signifying that it exhibits potential for reclassification as a novel subspecies. Our data demonstrate that mutL can be a genome-wide target for identifying and classifying the L. plantarum group species and for differentiating novel taxa from known species.

Download Full-text

DNABERT: pre-trained Bidirectional Encoder Representations from Transformers model for DNA-language in genome

Bioinformatics ◽

10.1093/bioinformatics/btab083 ◽

2021 ◽

Author(s):

Yanrong Ji ◽

Zhihan Zhou ◽

Han Liu ◽

Ramana V Davuluri

Keyword(s):

Dna Sequences ◽

Regulatory Elements ◽

Ease Of Use ◽

Fine Tuning ◽

Supplementary Information ◽

Sequence Motifs ◽

Semantic Relationship ◽

Accurate Identification ◽

Conserved Sequence ◽

Genome Wide

Abstract Motivation Deciphering the language of non-coding DNA is one of the fundamental problems in genome research. Gene regulatory code is highly complex due to the existence of polysemy and distant semantic relationship, which previous informatics methods often fail to capture especially in data-scarce scenarios. Results To address this challenge, we developed a novel pre-trained bidirectional encoder representation, named DNABERT, to capture global and transferrable understanding of genomic DNA sequences based on up and downstream nucleotide contexts. We compared DNABERT to the most widely used programs for genome-wide regulatory elements prediction and demonstrate its ease of use, accuracy and efficiency. We show that the single pre-trained transformers model can simultaneously achieve state-of-the-art performance on prediction of promoters, splice sites and transcription factor binding sites, after easy fine-tuning using small task-specific labeled data. Further, DNABERT enables direct visualization of nucleotide-level importance and semantic relationship within input sequences for better interpretability and accurate identification of conserved sequence motifs and functional genetic variant candidates. Finally, we demonstrate that pre-trained DNABERT with human genome can even be readily applied to other organisms with exceptional performance. We anticipate that the pre-trained DNABERT model can be fined tuned to many other sequence analyses tasks. Availability and implementation The source code, pretrained and finetuned model for DNABERT are available at GitHub (https://github.com/jerryji1993/DNABERT). Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Toward a global reference database of COI barcodes for marine zooplankton

Marine Biology ◽

10.1007/s00227-021-03887-y ◽

2021 ◽

Vol 168 (6) ◽

Author(s):

Ann Bucklin ◽

Katja T. C. A. Peijnenburg ◽

Ksenia N. Kosobokova ◽

Todd D. O’Brien ◽

Leocadio Blanco-Bercial ◽

...

Keyword(s):

Species Diversity ◽

Dna Sequences ◽

Reference Sequence ◽

Global Ocean ◽

Reference Database ◽

Data Repositories ◽

Marine Zooplankton ◽

The North ◽

Coi Sequences ◽

Taxonomic Groups

AbstractCharacterization of species diversity of zooplankton is key to understanding, assessing, and predicting the function and future of pelagic ecosystems throughout the global ocean. The marine zooplankton assemblage, including only metazoans, is highly diverse and taxonomically complex, with an estimated ~28,000 species of 41 major taxonomic groups. This review provides a comprehensive summary of DNA sequences for the barcode region of mitochondrial cytochrome oxidase I (COI) for identified specimens. The foundation of this summary is the MetaZooGene Barcode Atlas and Database (MZGdb), a new open-access data and metadata portal that is linked to NCBI GenBank and BOLD data repositories. The MZGdb provides enhanced quality control and tools for assembling COI reference sequence databases that are specific to selected taxonomic groups and/or ocean regions, with associated metadata (e.g., collection georeferencing, verification of species identification, molecular protocols), and tools for statistical analysis, mapping, and visualization. To date, over 150,000 COI sequences for ~ 5600 described species of marine metazoan plankton (including holo- and meroplankton) are available via the MZGdb portal. This review uses the MZGdb as a resource for summaries of COI barcode data and metadata for important taxonomic groups of marine zooplankton and selected regions, including the North Atlantic, Arctic, North Pacific, and Southern Oceans. The MZGdb is designed to provide a foundation for analysis of species diversity of marine zooplankton based on DNA barcoding and metabarcoding for assessment of marine ecosystems and rapid detection of the impacts of climate change.

Download Full-text

Research Paper DNA barcode identification of fish products from Guiyang markets in southwest China

Journal of Food Protection ◽

10.4315/jfp-21-258 ◽

2022 ◽

Author(s):

Qian Tang ◽

Qi Luo ◽

Qian Duan ◽

Lei Deng ◽

Renyi Zhang

Keyword(s):

Dna Barcoding ◽

Molecular Identification ◽

Fish Consumption ◽

Southwest China ◽

Dna Barcode ◽

Guizhou Province ◽

Accurate Identification ◽

Continuous Growth ◽

Fish Products ◽

Fresh Frozen

Nowadays, the global fish consumption continues to rise along with the continuous growth of the population, which has led to the dilemma of overfishing of fishery resources. Especially high-value fish that are overfished are often replaced by other fish. Therefore, the accurate identification of fish products in the market is a problem worthy of attention. In this study, full-DNA barcoding (FDB) and mini-DNA barcoding (MDB) used to detect the fraud of fish products in Guiyang, Guizhou province in China. The molecular identification results showed that 39 of the 191 samples were not consistent with the labels. The mislabelling of fish products for fresh, frozen, cooked and canned were 11.70%, 20.00%, 34.09% and 50.00%, respectively. The average kimura 2 parameter distances of MDB within species and genera were 0.27% and 5.41%, respectively; while average distances of FDB were 0.17% within species and 6.17% within genera. In this study, commercial fraud is noticeable, most of the high-priced fish were replaced of low-priced fish with a similar feature. Our study indicated that DNA barcoding is a valid tool for the identification of fish products and that it allows an idea of conservation and monitoring efforts, while confirming the MDB as a reliable tool for fish products.

Download Full-text