scholarly journals Quantifying uncertainty of taxonomic placement in DNA barcoding and metabarcoding

2016 ◽  
Author(s):  
Panu Somervuo ◽  
Douglas Yu ◽  
Charles Xu ◽  
Yinqiu Ji ◽  
Jenni Hultman ◽  
...  

AbstractA crucial step in the use of DNA markers for biodiversity surveys is the assignment of Linnaean taxonomies (species, genus, etc.) to sequence reads. This allows the use of all the information known based on the taxonomic names. Taxonomic placement of DNA barcoding sequences is inherently probabilistic because DNA sequences contain errors, because there is natural variation among sequences within a species, and because reference databases are incomplete and can have false annotations. However, most existing bioinformatics methods for taxonomic placement either exclude uncertainty, or quantify it using metrics other than probability.In this paper we evaluate the performance of a recently proposed probabilistic taxonomic placement method PROTAX by applying it to both annotated reference sequence data as well as unknown environmental data. Our four case studies include contrasting taxonomic groups (fungi, bacteria, mammals, and insects), variation in the length and quality of the barcoding sequences (from individually Sanger-sequenced sequences to short Illumina reads), variation in the structures and sizes of the taxonomies (from 800 to 130 000 species), and variation in the completeness of the reference databases (representing 15% to 100% of the species).Our results demonstrate that PROTAX yields essentially unbiased assessment of probabilities of taxonomic placement, and thus that its quantification of species identification uncertainty is reliable. As expected, the accuracy of taxonomic placement increases with increasing coverage of taxonomic and reference sequence databases, and with increasing ratio of genetic variation among taxonomic levels over within taxonomic levels.Our results show that reliable species-level identification from environmental samples is still challenging, and thus neglecting identification uncertainty can lead to spurious inference. A key aim for future research is the completion and pruning of taxonomic and reference sequence databases, and making these two types of data compatible.

2021 ◽  
Vol 168 (6) ◽  
Author(s):  
Ann Bucklin ◽  
Katja T. C. A. Peijnenburg ◽  
Ksenia N. Kosobokova ◽  
Todd D. O’Brien ◽  
Leocadio Blanco-Bercial ◽  
...  

AbstractCharacterization of species diversity of zooplankton is key to understanding, assessing, and predicting the function and future of pelagic ecosystems throughout the global ocean. The marine zooplankton assemblage, including only metazoans, is highly diverse and taxonomically complex, with an estimated ~28,000 species of 41 major taxonomic groups. This review provides a comprehensive summary of DNA sequences for the barcode region of mitochondrial cytochrome oxidase I (COI) for identified specimens. The foundation of this summary is the MetaZooGene Barcode Atlas and Database (MZGdb), a new open-access data and metadata portal that is linked to NCBI GenBank and BOLD data repositories. The MZGdb provides enhanced quality control and tools for assembling COI reference sequence databases that are specific to selected taxonomic groups and/or ocean regions, with associated metadata (e.g., collection georeferencing, verification of species identification, molecular protocols), and tools for statistical analysis, mapping, and visualization. To date, over 150,000 COI sequences for ~ 5600 described species of marine metazoan plankton (including holo- and meroplankton) are available via the MZGdb portal. This review uses the MZGdb as a resource for summaries of COI barcode data and metadata for important taxonomic groups of marine zooplankton and selected regions, including the North Atlantic, Arctic, North Pacific, and Southern Oceans. The MZGdb is designed to provide a foundation for analysis of species diversity of marine zooplankton based on DNA barcoding and metabarcoding for assessment of marine ecosystems and rapid detection of the impacts of climate change.


Author(s):  
Nicole Foster ◽  
Kor-jent Dijk ◽  
Ed Biffin ◽  
Jennifer Young ◽  
Vicki Thomson ◽  
...  

A proliferation in environmental DNA (eDNA) research has increased the reliance on reference sequence databases to assign unknown DNA sequences to known taxa. Without comprehensive reference databases, DNA extracted from environmental samples cannot be correctly assigned to taxa, limiting the use of this genetic information to identify organisms in unknown sample mixtures. For animals, standard metabarcoding practices involve amplification of the mitochondrial Cytochrome-c oxidase subunit 1 (CO1) region, which is a universally amplifyable region across majority of animal taxa. This region, however, does not work well as a DNA barcode for plants and fungi, and there is no similar universal single barcode locus that has the same species resolution. Therefore, generating reference sequences has been more difficult and several loci have been suggested to be used in parallel to get to species identification. For this reason, we developed a multi-gene targeted capture approach to generate reference DNA sequences for plant taxa across 20 target chloroplast gene regions in a single assay. We successfully compiled a reference database for 93 temperate coastal plants including seagrasses, mangroves, and saltmarshes/samphire’s. We demonstrate the importance of a comprehensive reference database to prevent species going undetected in eDNA studies. We also investigate how using multiple chloroplast gene regions impacts the ability to discriminate between taxa.


F1000Research ◽  
2020 ◽  
Vol 9 ◽  
pp. 339 ◽  
Author(s):  
Tshifhiwa G. Matumba ◽  
Jody Oliver ◽  
Nigel P. Barker ◽  
Christopher D. McQuaid ◽  
Peter R. Teske

Background: Mitochondrial DNA (mtDNA) has long been used to date historical demographic events. The idea that it is useful for molecular dating rests on the premise that its evolution is neutral. Even though this idea has long been challenged, the evidence against clock-like evolution of mtDNA is often ignored. Here, we present a particularly clear and simple example to illustrate the implications of violations of the assumption of selective neutrality. Methods: DNA sequences were generated for the mtDNA COI gene and the nuclear 28S rRNA of two closely related rocky shore snails, and species-level variation was compared. Nuclear rRNA is not usually used to study intraspecific variation in species that are not spatially structured, presumably because this marker is assumed to evolve so slowly that it is more suitable for phylogenetics.  Results: Even though high inter-specific divergence reflected the faster evolutionary rate of COI, intraspecific genetic variation was similar for both markers. As a result, estimates of population expansion times based on mismatch distributions differed between the two markers by millions of years. Conclusions: Assuming that 28S evolution is more clock-like, these findings can be explained by variation-reducing purifying selection in mtDNA at the species level, and an elevated divergence rate caused by diversifying selection between the two species. Although these two selective forces together make mtDNA suitable as a marker for species identifications by means of DNA barcoding because they create a ‘barcoding gap’, estimates of demographic change based on this marker can be expected to be highly unreliable. Our study contributes to the growing evidence that the utility of mtDNA sequence data beyond DNA barcoding is limited.


2017 ◽  
Author(s):  
Jan-Niklas Macher ◽  
Till-Hendrik Macher ◽  
Florian Leese

Metabarcoding and metagenomic approaches are becoming routine techniques in biodiversity assessment and ecological studies. The assignment of taxonomic information to sequences is challenging, as many reference libraries are lacking information on certain taxonomic groups and can contain erroneous sequences. Combining different reference databases is therefore a promising approach for maximizing taxonomic coverage and reliability of results. This tutorial shows how to use the “BOLD_NCBI_Merger” script to combine sequence data obtained from the National Center for Biotechnology Information (NCBI) GenBank and the Barcode of Life Database (BOLD) and prepare it for taxonomic assignment with the software MEGAN.


2018 ◽  
Vol 7 (2.20) ◽  
pp. 10
Author(s):  
G Geetha ◽  
G Surekha ◽  
P Aditya Sharma ◽  
E Uma Shankari

The primary target of this paper is to provide a secured implementing algorithm for hiding DNA sample sequence data confidently by using special software in cloud computing environments. The suggested algorithm here for hiding DNA sequences is based on binary coding and complementary pairing rules. Hence DNA reference sequence is taken as a sample secret data with a notation of M. But after applying some steps the final result obtained in cloud environment is M’’’. The procedure of identifying or extracting the original data M from the hidden DNA Reference sequence is depended on the user if and only if the user wants to use the data for process. Likewise there are security issues for the manipulating from claiming information. In this way that accessible user’s information arrangement may be isolated under SPs in such a way that it has to reach minimum number of specified threshold SPs number from the whole data block. In this paper, we recommend A low cost secured and multi- cloud storage  (SCMCS) model over cloud computing which holds an prudent appropriation about information Around the accessible SPs in the market, with gatherings gives client information accessibility and additionally secure capacity.  


2021 ◽  
Vol 33 (2) ◽  
pp. 155
Author(s):  
K. Clark ◽  
J. Cole ◽  
D. Bickhart ◽  
J. Hutchison ◽  
D. Null ◽  
...  

Holstein haplotype 2 (HH2) is embryonic lethal and carried by 1.21% of the US Holstein population. Using next-generation sequencing, we identified a high-impact frameshift mutation in intraflagellar protein 80 (IFT80) as the putative causal mutation. In bovine embryos, IFT80 expression begins at the 8-cell stage and decreases by the blastocyst stage. We hypothesised that the loss of function of IFT80 early in development causes the lethal phenotype. The aim of this study was to mimic the mutation observed invivo using a CRISPR-Cas9 approach to determine its effect on embryo development. Two guide RNAs (gRNAs) were designed to disrupt exon 11 (Ex11), one before and one after the known IFT80 mutation site, creating a 317-nucleotide (nt) cut to facilitate genotyping. Then, gRNAs annealed to a tracr-Cas9mRNA complex were delivered to 1-cell embryos by microinjection. Each replicate contained control embryos injected with only Cas9mRNA and treated embryos injected with gRNAs targeting IFT80. Embryos from each group were collected at the 8-cell stage for genotyping and gene expression analysis (n=47), or on Day 8 to validate genotypes of embryos left to develop (n=50). DNA sequences containing gRNA target sequences were amplified and visualised on an agarose gel. IFT80 expression was determined in biallelic embryos (n=13) using quantitative PCR and normalized to GAPDH. Primers were designed for the transcript regions before and after gRNAs target sequences, exons 9 and 12, respectively. Expression data were analysed using SAS software (v. 9.4; SAS Institute Inc.) using PROC GLM and LSMEANS to determine expression differences. Biallelic samples (n=9) were Sanger-sequenced (SS) and aligned with the reference sequence to determine exact cut sites. Protein amino acid (AA) sequences were predicted using SS data. Protein models were constructed using the I-Tasser platform, and then aligned and visualised using PyMol 2.4. Biallelic edits showed a significant decrease in exon 12 expression (P<0.05), and no difference in exon 9 compared with controls (P>0.05), indicating that the transcript was severely affected downstream of the edited sites. The reference protein model contained 777 AA, whereas the biallelic sample with the most accurate cut sites yielded a 385-AA protein, indicating that the mutation severely altered protein conformation and possible function. Embryos injected with CRISPR-Cas9 targeting Ex11 arrested at the 8-cell stage and failed to form blastocysts. Day 8 embryos were genotyped (n=24) and 58% were biallelic, 21% were monoallelic, and 21% appeared wild-type. Given the high rate of edits, the observed embryonic arrest is likely due to disruption of IFT80, and wild-type embryos may contain small edits not visible by gel. In conclusion, generation of CRISPR-Cas9 IFT80 knockouts demonstrated that the frameshift mutation in Ex11 results in a seemingly nonfunctional protein that is responsible for the embryonic lethality seen in HH2 carriers. Future research is needed to determine how IFT80 regulates embryonic development. This research was supported by USDA-NIFA National Needs Fellowship, USDA-NIFA AFRI Grant No. 2019-67015-28998.


Genome ◽  
2016 ◽  
Vol 59 (11) ◽  
pp. 913-932 ◽  
Author(s):  
Jianping Xu

Fungi are ubiquitous in both natural and human-made environments. They play important roles in the health of plants, animals, and humans, and in broad ecosystem functions. Thus, having an efficient species-level identification system could significantly enhance our ability to treat fungal diseases and to monitor the spatial and temporal patterns of fungal distributions and migrations. DNA barcoding is a potent approach for rapid identification of fungal specimens, generating novel species hypothesis, and guiding biodiversity and ecological studies. In this mini-review, I briefly summarize (i) the history of DNA sequence-based fungal identification; (ii) the emergence of the ITS region as the consensus primary fungal barcode; (iii) the use of the ITS barcodes to address a variety of issues on fungal diversity from local to global scales, including generating a large number of species hypothesis; and (iv) the problems with the ITS barcode region and the approaches to overcome these problems. Similar to DNA barcoding research on plants and animals, significant progress has been achieved over the last few years in terms of both the questions being addressed and the foundations being laid for future research endeavors. However, significant challenges remain. I suggest three broad areas of research to enhance the usefulness of fungal DNA barcoding to meet the current and future challenges: (i) develop a common set of primers and technologies that allow the amplification and sequencing of all fungi at both the primary and secondary barcode loci; (ii) compile a centralized reference database that includes all recognized fungal species as well as species hypothesis, and allows regular updates from the research community; and (iii) establish a consensus set of new species recognition criteria based on barcode DNA sequences that can be applied across the fungal kingdom.


Author(s):  
Martin Steinegger ◽  
Steven L Salzberg

Metagenomic sequencing allows researchers to investigate organisms sampled from their native environments by sequencing their DNA directly, and then quantifying the abundance and taxonomic composition of the organisms thus captured. However, these types of analyses are sensitive to contamination in public databases caused by incorrectly labeled reference sequences. Here we describe Conterminator, an efficient method to detect and remove incorrectly labelled sequences by an exhaustive all-against-all sequence comparison. Our analysis reports contamination in 114,035 sequences and 2767 species in the NCBI Reference Sequence Database (RefSeq), 2,161,746 sequences and 6795 species in the GenBank database, and 14,132 protein sequences in the NR non-redundant protein database. Conterminator uncovers contamination in sequences spanning the whole range from draft genomes to “complete” model organism genomes. Our method, which scales linearly with input size, was able to process 3.3 terabytes of genomic sequence data in 12 days on a single 32-core compute node. We believe that Conterminator can become an important tool to ensure the quality of reference databases with particular importance for downstream metagenomic analyses. Source code (GPLv3): https://github.com/martin-steinegger/conterminator


2018 ◽  
Author(s):  
Michael Gruenstaeudl ◽  
Yannick Hartmaring

AbstractBackgroundThe submission of DNA sequences to public sequence databases is an essential, but insufficiently automated step in the process of generating and disseminating novel DNA sequence data. Despite the centrality of database submissions to biological research, the range of available software tools that facilitate the preparation of sequence data for database submissions is low, especially for sequences generated via plant DNA barcoding. Current submission procedures can be complex and prohibitively time expensive for any but a small number of input sequences. A user-friendly software tool is needed that streamlines the file preparation for database submissions of DNA sequences that are commonly generated in plant DNA barcoding.MethodsA Python package was developed that converts DNA sequences from the common EMBL and GenBank flat file formats to submission-ready, tab-delimited spreadsheets (so-called “checklists”) for a subsequent upload to the public sequence database of the European Nucleotide Archive (ENA). The software tool, titled “EMBL2checklists”, automatically converts DNA sequences, their annotation features, and associated metadata into the idiosyncratic format of marker-specific ENA checklists and, thus, generates output that can be uploaded via the interactive Webin submission system of ENA.ResultsEMBL2checklists provides a simple, platform-independent tool that automates the conversion of common plant DNA barcoding sequences into easily editable spreadsheets that require no further processing but their upload to ENA via the interactive Webin submission system. The software is equipped with an intuitive graphical as well as an efficient command-line interface for its operation. The utility of the software is illustrated by its application in the submission of DNA sequences of two recent plant phylogenetic investigations and one fungal metagenomic study.DiscussionEMBL2checklists bridges the gap between common software suites for DNA sequence assembly and annotation and the interactive data submission process of ENA. It represents an easy-to-use solution for plant biologists without bioinformatics expertise to generate submission-ready checklists from common plant DNA sequence data. It allows the post-processing of checklists as well as work-sharing during the submission process and solves a critical bottleneck in the effort to increase participation in public data sharing.


Proteomes ◽  
2019 ◽  
Vol 7 (2) ◽  
pp. 19
Author(s):  
Yoji Igarashi ◽  
Daisuke Mori ◽  
Susumu Mitsuyama ◽  
Kazutoshi Yoshitake ◽  
Hiroaki Ono ◽  
...  

Metagenomic data have mainly been addressed by showing the composition of organisms based on a small part of a well-examined genomic sequence, such as ribosomal RNA genes and mitochondrial DNAs. On the contrary, whole metagenomic data obtained by the shotgun sequence method have not often been fully analyzed through a homology search because the genomic data in databases for living organisms on earth are insufficient. In order to complement the results obtained through homology-search-based methods with shotgun metagenomes data, we focused on the composition of protein domains deduced from the sequences of genomes and metagenomes, and we utilized them in characterizing genomes and metagenomes, respectively. First, we compared the relationships based on similarities in the protein domain composition with the relationships based on sequence similarities. We searched for protein domains of 325 bacterial species produced using the Pfam database. Next, the correlation coefficients of protein domain compositions between every pair of bacteria were examined. Every pairwise genetic distance was also calculated from 16S rRNA or DNA gyrase subunit B. We compared the results of these methods and found a moderate correlation between them. Essentially, the same results were obtained when we used partial random 100 bp DNA sequences of the bacterial genomes, which simulated raw sequence data obtained from short-read next-generation sequences. Then, we applied the method for analyzing the actual environmental data obtained by shotgun sequencing. We found that the transition of the microbial phase occurred because the seasonal change in water temperature was shown by the method. These results showed the usability of the method in characterizing metagenomic data based on protein domain compositions.


Sign in / Sign up

Export Citation Format

Share Document