scholarly journals Overcoming limitations to environmental DNA studies: A coastal temperate reference sequence database for multiple chloroplast gene regions generated in a single assay.

Author(s):  
Nicole Foster ◽  
Kor-jent Dijk ◽  
Ed Biffin ◽  
Jennifer Young ◽  
Vicki Thomson ◽  
...  

A proliferation in environmental DNA (eDNA) research has increased the reliance on reference sequence databases to assign unknown DNA sequences to known taxa. Without comprehensive reference databases, DNA extracted from environmental samples cannot be correctly assigned to taxa, limiting the use of this genetic information to identify organisms in unknown sample mixtures. For animals, standard metabarcoding practices involve amplification of the mitochondrial Cytochrome-c oxidase subunit 1 (CO1) region, which is a universally amplifyable region across majority of animal taxa. This region, however, does not work well as a DNA barcode for plants and fungi, and there is no similar universal single barcode locus that has the same species resolution. Therefore, generating reference sequences has been more difficult and several loci have been suggested to be used in parallel to get to species identification. For this reason, we developed a multi-gene targeted capture approach to generate reference DNA sequences for plant taxa across 20 target chloroplast gene regions in a single assay. We successfully compiled a reference database for 93 temperate coastal plants including seagrasses, mangroves, and saltmarshes/samphire’s. We demonstrate the importance of a comprehensive reference database to prevent species going undetected in eDNA studies. We also investigate how using multiple chloroplast gene regions impacts the ability to discriminate between taxa.

Genome ◽  
2019 ◽  
Vol 62 (3) ◽  
pp. 160-169 ◽  
Author(s):  
Wieland Meyer ◽  
Laszlo Irinyi ◽  
Minh Thuy Vi Hoang ◽  
Vincent Robert ◽  
Dea Garcia-Hermoso ◽  
...  

With new or emerging fungal infections, human and animal fungal pathogens are a growing threat worldwide. Current diagnostic tools are slow, non-specific at the species and subspecies levels, and require specific morphological expertise to accurately identify pathogens from pure cultures. DNA barcodes are easily amplified, universal, short species-specific DNA sequences, which enable rapid identification by comparison with a well-curated reference sequence collection. The primary fungal DNA barcode, ITS region, was introduced in 2012 and is now routinely used in diagnostic laboratories. However, the ITS region only accurately identifies around 75% of all medically relevant fungal species, which has prompted the development of a secondary barcode to increase the resolution power and suitability of DNA barcoding for fungal disease diagnostics. The translational elongation factor 1α (TEF1α) was selected in 2015 as a secondary fungal DNA barcode, but it has not been implemented into practice, due to the absence of a reference database. Here, we have established a quality-controlled reference database for the secondary barcode that together with the ISHAM-ITS database, forms the ISHAM barcode database, available online at http://its.mycologylab.org/ . We encourage the mycology community for active contributions.


2021 ◽  
Vol 17 (11) ◽  
pp. e1009581
Author(s):  
Michael S. Robeson ◽  
Devon R. O’Rourke ◽  
Benjamin D. Kaehler ◽  
Michal Ziemski ◽  
Matthew R. Dillon ◽  
...  

Nucleotide sequence and taxonomy reference databases are critical resources for widespread applications including marker-gene and metagenome sequencing for microbiome analysis, diet metabarcoding, and environmental DNA (eDNA) surveys. Reproducibly generating, managing, using, and evaluating nucleotide sequence and taxonomy reference databases creates a significant bottleneck for researchers aiming to generate custom sequence databases. Furthermore, database composition drastically influences results, and lack of standardization limits cross-study comparisons. To address these challenges, we developed RESCRIPt, a Python 3 software package and QIIME 2 plugin for reproducible generation and management of reference sequence taxonomy databases, including dedicated functions that streamline creating databases from popular sources, and functions for evaluating, comparing, and interactively exploring qualitative and quantitative characteristics across reference databases. To highlight the breadth and capabilities of RESCRIPt, we provide several examples for working with popular databases for microbiome profiling (SILVA, Greengenes, NCBI-RefSeq, GTDB), eDNA and diet metabarcoding surveys (BOLD, GenBank), as well as for genome comparison. We show that bigger is not always better, and reference databases with standardized taxonomies and those that focus on type strains have quantitative advantages, though may not be appropriate for all use cases. Most databases appear to benefit from some curation (quality filtering), though sequence clustering appears detrimental to database quality. Finally, we demonstrate the breadth and extensibility of RESCRIPt for reproducible workflows with a comparison of global hepatitis genomes. RESCRIPt provides tools to democratize the process of reference database acquisition and management, enabling researchers to reproducibly and transparently create reference materials for diverse research applications. RESCRIPt is released under a permissive BSD-3 license at https://github.com/bokulich-lab/RESCRIPt.


Author(s):  
Michael S. Robeson ◽  
Devon R. O’Rourke ◽  
Benjamin D. Kaehler ◽  
Michal Ziemski ◽  
Matthew R. Dillon ◽  
...  

AbstractBackgroundNucleotide sequence and taxonomy reference databases are critical resources for widespread applications including marker-gene and metagenome sequencing for microbiome analysis, diet metabarcoding, and environmental DNA (eDNA) surveys. Reproducibly generating, managing, using, and evaluating nucleotide sequence and taxonomy reference databases creates a significant bottleneck for researchers aiming to generate custom sequence databases. Furthermore, database composition drastically influences results, and lack of standardizations limits cross-study comparisons. To address these challenges, we developed RESCRIPt, a software package for reproducible generation and management of reference sequence taxonomy databases, including dedicated functions that streamline creating databases from popular sources, and functions for evaluating, comparing, and interactively exploring qualitative and quantitative characteristics across reference databases.ResultsTo highlight the breadth and capabilities of RESCRIPt, we provide several examples for working with popular databases for microbiome profiling (SILVA, Greengenes, NCBI-RefSeq, GTDB), eDNA, and diet metabarcoding surveys (BOLD, GenBank), as well as for genome comparison. We show that bigger is not always better, and reference databases with standardized taxonomies and those that focus on type strains have quantitative advantages, though may not be appropriate for all use cases. Most databases appear to benefit from some curation (quality filtering), though sequence clustering appears detrimental to database quality. Finally, we demonstrate the breadth and extensibility of RESCRIPt for reproducible workflows with a comparison of global hepatitis genomes.ConclusionsRESCRIPt provides tools to democratize the process of reference database acquisition and management, enabling researchers to reproducibly and transparently create reference materials for diverse research applications. RESCRIPt is released under a permissive BSD-3 license at https://github.com/bokulich-lab/RESCRIPt.


2021 ◽  
Vol 168 (6) ◽  
Author(s):  
Ann Bucklin ◽  
Katja T. C. A. Peijnenburg ◽  
Ksenia N. Kosobokova ◽  
Todd D. O’Brien ◽  
Leocadio Blanco-Bercial ◽  
...  

AbstractCharacterization of species diversity of zooplankton is key to understanding, assessing, and predicting the function and future of pelagic ecosystems throughout the global ocean. The marine zooplankton assemblage, including only metazoans, is highly diverse and taxonomically complex, with an estimated ~28,000 species of 41 major taxonomic groups. This review provides a comprehensive summary of DNA sequences for the barcode region of mitochondrial cytochrome oxidase I (COI) for identified specimens. The foundation of this summary is the MetaZooGene Barcode Atlas and Database (MZGdb), a new open-access data and metadata portal that is linked to NCBI GenBank and BOLD data repositories. The MZGdb provides enhanced quality control and tools for assembling COI reference sequence databases that are specific to selected taxonomic groups and/or ocean regions, with associated metadata (e.g., collection georeferencing, verification of species identification, molecular protocols), and tools for statistical analysis, mapping, and visualization. To date, over 150,000 COI sequences for ~ 5600 described species of marine metazoan plankton (including holo- and meroplankton) are available via the MZGdb portal. This review uses the MZGdb as a resource for summaries of COI barcode data and metadata for important taxonomic groups of marine zooplankton and selected regions, including the North Atlantic, Arctic, North Pacific, and Southern Oceans. The MZGdb is designed to provide a foundation for analysis of species diversity of marine zooplankton based on DNA barcoding and metabarcoding for assessment of marine ecosystems and rapid detection of the impacts of climate change.


2021 ◽  
Author(s):  
Thomas K. F. Wong ◽  
Teng Li ◽  
Louis Ranjard ◽  
Steven Wu ◽  
Jeet Sukumaran ◽  
...  

AbstractA current strategy for obtaining haplotype information from several individuals involves short-read sequencing of pooled amplicons, where fragments from each individual is identified by a unique DNA barcode. In this paper, we report a new method to recover the phylogeny of haplotypes from short-read sequences obtained using pooled amplicons from a mixture of individuals, without barcoding. The method, AFPhyloMix, accepts an alignment of the mixture of reads against a reference sequence, obtains the single-nucleotide-polymorphisms (SNP) patterns along the alignment, and constructs the phylogenetic tree according to the SNP patterns. AFPhyloMix adopts a Bayesian model of inference to estimates the phylogeny of the haplotypes and their relative frequencies, given that the number of haplotypes is known. In our simulations, AFPhyloMix achieved at least 80% accuracy at recovering the phylogenies and frequencies of the constituent haplotypes, for mixtures with up to 15 haplotypes. AFPhyloMix also worked well on a real data set of kangaroo mitochondrial DNA sequences.


2016 ◽  
Author(s):  
Jonathan A Coddington ◽  
Ingi Agnarsson ◽  
Ren-Chung Cheng ◽  
Klemen Čandek ◽  
Amy Driskell ◽  
...  

The use of unique DNA sequences as a method for taxonomic identification is no longer fundamentally controversial, even though debate continues on the best markers, methods, and technology to use. Although both existing databanks such as GenBank and BOLD, as well as reference taxonomies, are imperfect, in best case scenarios “barcodes” (whether single or multiple, organelle or nuclear, loci) clearly are an increasingly fast and inexpensive method of identification, especially as compared to manual identification of unknowns by increasingly rare expert taxonomists. Because most species on Earth are undescribed, a complete reference database at the species level is impractical in the near term. The question therefore arises whether unidentified species can, using DNA barcodes, be accurately assigned to more inclusive groups such as genera and families—taxonomic ranks of putatively monophyletic groups for which the global inventory is more complete and stable. We used a carefully chosen test library of CO1 sequences from 49 families, 313 genera, and 816 species of spiders to assess the accuracy of genus and family-level identifications. We used BLAST queries of each sequence against the entire library and got the top ten hits resulting in 8160 hits. The percent sequence identity was reported from these hits (PIdent, range 75-100%). Accurate identification (PIdent above which errors totaled less than 5%) occurred for genera at PIdent values > 95 and families at PIdent values ≥ 91, suggesting these as heuristic thresholds for generic and familial identifications in spiders. Accuracy of identification increases with numbers of species/genus and genera/family in the library; above five genera per family and fifteen species per genus all identifications were correct. We propose that using percent sequence identity between conventional barcode sequences may be a feasible and reasonably accurate method to identify animals to family/genus. However, the quality of the underlying database impacts accuracy of results; many outliers in our dataset could be attributed to taxonomic and/or sequencing errors in BOLD and GenBank. It seems that an accurate and complete reference library of families and genera of life could provide accurate higher level taxonomic identifications cheaply and accessibly, within years rather than decades.


2021 ◽  
Author(s):  
Nicole Foster ◽  
Kor-jent Van Dijk ◽  
Edward Biffin ◽  
Jennifer Young ◽  
Vicki Ann Thomson ◽  
...  

Metabarcoding of plant DNA recovered from environmental samples, termed environmental DNA (eDNA), has been used to detect invasive species, track biodiversity changes and reconstruct past ecosystems. The P6 loop of the trnL intron is the most widely utilized gene region for metabarcoding plants due to the short fragment length and subsequent ease of recovery from degraded DNA, which is characteristic of environmental samples. However, the taxonomic resolution for this gene region is limited, often precluding species level identification. Additionally, targeting gene regions using universal primers can bias results as some taxa will amplify more effectively than others. To increase the ability of DNA metabarcoding to better resolve flowering plant species (angiosperms) within environmental samples, and reduce bias in amplification, we developed a multi-gene targeted capture method that simultaneously targets 20 chloroplast gene regions in a single assay across all flowering plant species. Using this approach, we effectively recovered multiple chloroplast gene regions for three species within artificial DNA mixtures down to 0.001 ng/uL of DNA. We tested the detection level of this approach, successfully recovering target genes for 10 flowering plant species. Finally, we applied this approach to sediment samples containing unknown compositions of environmental DNA and confidently detected plant species that were later verified with observation data. Targeting multiple chloroplast gene regions in environmental samples enabled species-level information to be recovered from complex DNA mixtures. Thus, the method developed here, confers an improved level of data on community composition, which can be used to better understand flowering plant assemblages in environmental samples.


Genome ◽  
2020 ◽  
pp. 1-34
Author(s):  
Andreas Kolter ◽  
Birgit Gemeinholzer

The problem of low species-level identification rates in plants by DNA barcoding is exacerbated by the fact that reference databases are far from being comprehensive. We investigate the impact of increased sampling depth on identification success by analyzing the efficacy of established plant barcode marker sequences (rbcL, matK, trnL-trnF, psbA-trnH, ITS). Adding sequences of the same species to the reference database led to an increase in correct species assignment of +10.9% for rbcL and +19.0% for ITS. Simultaneously, erroneous identification dropped from ∼40% to ∼12.5%. Despite its evolutionary constraints, ITS showed the highest identification rate and identification gain by increased sampling effort, which makes it a very suitable marker in the planning phase of a barcode study. The limited sequence availability of trnL-trnF is problematic for an otherwise very promising plastid plant barcoding marker. Future developments in machine learning algorithms have the potential to give new impetus to plant barcoding, but are dependent on extensive reference databases. We expect that our results will be incorporated into future plans for the development of DNA barcoding reference databases and will lead to these being developed with greater depth and taxonomic coverage.


PLoS ONE ◽  
2021 ◽  
Vol 16 (6) ◽  
pp. e0253772
Author(s):  
Rosa E. Prahl ◽  
Shahjahan Khan ◽  
Ravinesh C. Deo

Many fungi require specific growth conditions before they can be identified. Direct environmental DNA sequencing is advantageous, although for some taxa, specific primers need to be used for successful amplification of molecular markers. The internal transcribed spacer region is the preferred DNA barcode for fungi. However, inter- and intra-specific distances in ITS sequences highly vary among some fungal groups; consequently, it is not a solely reliable tool for species delineation. Ampelomyces, mycoparasites of the fungal phytopathogen order Erysiphales, can have ITS genetic differences up to 15%; this may lead to misidentification with other closely related unknown fungi. Indeed, Ampelomyces were initially misidentified as other pycnidial mycoparasites, but subsequent research showed that they differ in pycnidia morphology and culture characteristics. We investigated whether the ITS2 nucleotide content and secondary structure was different between Ampelomyces ITS2 sequences and those unrelated to this genus. To this end, we retrieved all ITS sequences referred to as Ampelomyces from the GenBank database. This analysis revealed that fungal ITS environmental DNA sequences are still being deposited in the database under the name Ampelomyces, but they do not belong to this genus. We also detected variations in the conserved hybridization model of the ITS2 proximal 5.8S and 28S stem from two Ampelomyces strains. Moreover, we suggested for the first time that pseudogenes form in the ITS region of this mycoparasite. A phylogenetic analysis based on ITS2 sequences-structures grouped the environmental sequences of putative Ampelomyces into a different clade from the Ampelomyces-containing clades. Indeed, when conducting ITS2 analysis, resolution of genetic distances between Ampelomyces and those putative Ampelomyces improved. Each clade represented a distinct consensus ITS2 S2, which suggested that different pre-ribosomal RNA (pre-rRNA) processes occur across different lineages. This study recommends the use of ITS2 S2s as an important tool to analyse environmental sequencing and unveiling the underlying evolutionary processes.


2021 ◽  
Vol 4 ◽  
Author(s):  
François Keck ◽  
Florian Altermatt

Reference databases of sequences that have been taxonomically assigned are a key element for DNA-based identification of organisms. Accurate and complete reference databases are necessary to associate a correct taxonomic name to the sequences obtained in studies using metabarcoding. Today many research projects using DNA metabarcoding include the development of a custom reference database, often derived from large repositories like GenBank. At the same time, many projects are focussing on the development of ready-to-use databases validated by experts and targeting specific markers and taxonomic groups. While mainstream tools such as spreadsheet softwares may be suitable to manage small databases, they quickly become insufficient when the amount of data increases and validation operations become more complex. There is a clear need for providing user‐friendly and powerful tools to manipulate biological sequences and manage reference databases. The R language which is a free software and has already been adopted by many researchers to perform their analyses is highly suitable to develop such tools. In this talk, we will outline the approach we recommend to handle small- to middle-sized reference databases, currently still making the majority of projects. We will advocate that a simple tabular approach where each sequence constitutes an observation may be the most adequate. While such a single table may be less flexible and less optimized than relational databases or more complex data structures, it is easy to maintain and allows the direct use of modern dataframe centric tools. We will specifically present and discuss two R packages that can be used jointly to make reference database development more accessible and more reproducible. First, we will briefly introduce bioseq (Keck 2020) which is dedicated to biological sequence manipulation and analysis. The package implements classes and functions to make analyses of complex datasets including DNA, RNA or protein sequences as simple as possible. The strength of bioseq is to provide standard and more advanced functions to perform low level operations through a simple and consistent programming interface. Then we will present refdb, which has been developed as an environment for semi-automatic and assisted construction of reference databases. The refdb package is a reference database manager offering a set of powerful functions to import, organize, clean, filter, audit and export the data. We will outline how these two packages together can speed up reference database generation and handling, and contribute to standardization and repeatability in metabarcoding studies.


Sign in / Sign up

Export Citation Format

Share Document