scholarly journals Assessing accuracy and completeness of GenBank for eDNA metabarcoding: towards a reliable marine fish reference database

2021 ◽  
Vol 4 ◽  
Author(s):  
Cristina Claver ◽  
Oriol Canals ◽  
Naiara Rodriguez-Ezpeleta

Environmental DNA (eDNA) metabarcoding, the process of sequencing DNA collected from the environment for producing biodiversity inventories, is increasingly being applied to assess fish diversity and distribution in marine environments. Yet, the successful application of this technique deeply relies on accurate and complete reference databases used for taxonomic assignment. The most used markers for fish eDNA metabarcoding studies are the cytochrome C oxidase subunit 1 (COI), 16S ribosomal RNA (16S), the 12S ribosomal RNA (12S) and cytochrome b (cyt b) genes, whose sequences are usually retrieved from GenBank, the largest DNA sequence database that represents a worldwide public resource for genetic studies. Thus, the completeness and accuracy of GenBank is critical to derive reliable estimations from fish eDNA metabarcoding data. Here, we have i) compiled the checklist of European marine fishes, ii) performed a gap analysis of the four genes and, within COI and 12S, also of the most used barcodes for fish, and iii) developed a workflow to detect potentially incorrect records in GenBank. We found that from the 1965 species in the checklist (1761 Actinopterygii, 189 Elasmobranchii, 9 Holocephali, 4 Petromyzonti and 2 Myxini), about 70% have sequences for COI, whereas less have sequences for 12S, 16S and cyt b (45-55%). Among the species for which COI ad 12S sequences are available, about 60% and 40% have sequences covering the most used barcodes respectively. The analysis of pairwise distances between sequences revealed pairs belonging to the same species with significantly low similarity and pairs belonging to different high level taxonomic groups (class, order) with significantly large similarity. In light of this further confirmation of presence of a substantial number of incorrect records in GenBank, we propose a method for identifying and removing spurious sequences to create reliable and accurate reference databases for eDNA metabarcoding.

2018 ◽  
Author(s):  
Jan Axtner ◽  
Alex Crampton-Platt ◽  
Lisa A. Hörig ◽  
Azlan Mohamed ◽  
Charles C.Y. Xu ◽  
...  

AbstractBackgroundThe use of environmental DNA, ‘eDNA,’ for species detection via metabarcoding is growing rapidly. We present a co-designed lab workflow and bioinformatic pipeline to mitigate the two most important risks of eDNA: sample contamination and taxonomic mis-assignment. These risks arise from the need for PCR amplification to detect the trace amounts of DNA combined with the necessity of using short target regions due to DNA degradation.FindingsOur high-throughput workflow minimises these risks via a four-step strategy: (1) technical replication with two PCR replicates and two extraction replicates; (2) using multi-markers (12S, 16S, CytB); (3) a ‘twin-tagging,’ two-step PCR protocol;(4) use of the probabilistic taxonomic assignment method PROTAX, which can account for incomplete reference databases.As annotation errors in the reference sequences can result in taxonomic mis-assignment, we supply a protocol for curating sequence datasets. For some taxonomic groups and some markers, curation resulted in over 50% of sequences being deleted from public reference databases, due to (1) limited overlap between our target amplicon and reference sequences; (2) mislabelling of reference sequences; (3) redundancy.Finally, we provide a bioinformatic pipeline to process amplicons and conduct PROTAX assignment and tested it on an ‘invertebrate derived DNA’ (iDNA) dataset from 1532 leeches from Sabah, Malaysia. Twin-tagging allowed us to detect and exclude sequences with non-matching tags. The smallest DNA fragment (16S) amplified most frequently for all samples, but was less powerful for discriminating at species rank. Using a stringent and lax acceptance criteria we found 162 (stringent) and 190 (lax) vertebrate detections of 95 (stringent) and 109 (lax) leech samples.ConclusionsOur metabarcoding workflow should help research groups increase the robustness of their results and therefore facilitate wider usage of e/iDNA, which is turning into a valuable source of ecological and conservation information on tetrapods.


Author(s):  
Nicole Foster ◽  
Kor-jent Dijk ◽  
Ed Biffin ◽  
Jennifer Young ◽  
Vicki Thomson ◽  
...  

A proliferation in environmental DNA (eDNA) research has increased the reliance on reference sequence databases to assign unknown DNA sequences to known taxa. Without comprehensive reference databases, DNA extracted from environmental samples cannot be correctly assigned to taxa, limiting the use of this genetic information to identify organisms in unknown sample mixtures. For animals, standard metabarcoding practices involve amplification of the mitochondrial Cytochrome-c oxidase subunit 1 (CO1) region, which is a universally amplifyable region across majority of animal taxa. This region, however, does not work well as a DNA barcode for plants and fungi, and there is no similar universal single barcode locus that has the same species resolution. Therefore, generating reference sequences has been more difficult and several loci have been suggested to be used in parallel to get to species identification. For this reason, we developed a multi-gene targeted capture approach to generate reference DNA sequences for plant taxa across 20 target chloroplast gene regions in a single assay. We successfully compiled a reference database for 93 temperate coastal plants including seagrasses, mangroves, and saltmarshes/samphire’s. We demonstrate the importance of a comprehensive reference database to prevent species going undetected in eDNA studies. We also investigate how using multiple chloroplast gene regions impacts the ability to discriminate between taxa.


2017 ◽  
Author(s):  
Jan-Niklas Macher ◽  
Till-Hendrik Macher ◽  
Florian Leese

Metabarcoding and metagenomic approaches are becoming routine techniques in biodiversity assessment and ecological studies. The assignment of taxonomic information to sequences is challenging, as many reference libraries are lacking information on certain taxonomic groups and can contain erroneous sequences. Combining different reference databases is therefore a promising approach for maximizing taxonomic coverage and reliability of results. This tutorial shows how to use the “BOLD_NCBI_Merger” script to combine sequence data obtained from the National Center for Biotechnology Information (NCBI) GenBank and the Barcode of Life Database (BOLD) and prepare it for taxonomic assignment with the software MEGAN.


2021 ◽  
Vol 4 ◽  
Author(s):  
Haris Zafeiropoulos ◽  
Christina Pavloudi ◽  
Evangelos Pafilis

Environmental DNA (eDNA) and metabarcoding have launched a new era in bio- and eco-assessment over the last years (Ruppert et al. 2019). The simultaneous identification, at the lowest taxonomic level possible, of a mixture of taxa from a great range of samples is now feasible; thus, the number of eDNA metabarcoding studies has increased radically (Deiner and 2017). While the experimental part of eDNA metabarcoding can be rather challenging depending on the special characteristics of the different studies, computational issues are considered to be its major bottlenecks. Among the latter, the bioinformatics analysis of metabarcoding data and especially the taxonomy assignment of the sequences are fundamental challenges. Many steps are required to obtain taxonomically assigned matrices from raw data. For most of these, a plethora of tools are available. However, each tool's execution parameters need to be tailored to reflect each experiment's idiosyncrasy; thus, tuning bioinformatics analysis has proved itself fundamental (Kamenova 2020). The computation capacity of high-performance computing systems (HPC) is frequently required for such analyses. On top of that, the non perfect completeness and correctness of the reference taxonomy databases is another important issue (Loos et al. 2020). Based on third-party tools, we have developed the Pipeline for Environmental Metabarcoding Analysis (PEMA), a HPC-centered, containerized assembly of key metabarcoding analysis tools. PEMA combines state-of-the art technologies and algorithms with an easy to get-set-use framework, allowing researchers to tune thoroughly each study thanks to roll-back checkpoints and on-demand partial pipeline execution features (Zafeiropoulos 2020). Once PEMA was released, there were two main pitfalls soon to be highlighted by users. PEMA supported 4 marker genes and was bounded by specific reference databases. In this new version of PEMA the analysis of any marker gene is now available since a new feature was added, allowing classifiers to train a user-provided reference database and use it for taxonomic assignment. Fig. 1 shows the taxonomy assignment related PEMA modules; all those out of the dashed box have been developed for this new PEMA release. As shown, the RDPClassifier has been trained with Midori reference 2 and has been added as an option, classifying not only metazoans but sequences from all taxonomic groups of Eukaryotes for the case of the COI marker gene. A PEMA documentation site is now also available. PEMA.v2 containers are available via the DockerHub and SingularityHub as well as through the Elixir Greece AAI Service. It has also been selected to be part of the LifeWatch ERIC Internal Joint Initiative for the analysis of ARMS data and soon will be available through the Tesseract VRE.


2021 ◽  
Vol 5 ◽  
Author(s):  
Alexis Canino ◽  
Agnès Bouchez ◽  
Christophe Laplace-Treyture ◽  
Isabelle Domaizon ◽  
Frédéric Rimet

Methods for biomonitoring of freshwater phytoplankton are evolving rapidly with eDNA-based methods, offering great complementarity with microscopy. Metabarcoding approaches have been more commonly used over the last years, with a continuous increase in the amount of data generated. Depending on the researchers and the way they assigned barcodes to species (bioinformatic pipelines and molecular reference databases), the taxonomic assignment obtained for HTS DNA reads might vary. This is also true for traditional taxonomic studies by microscopy with regular adjustments of the classification and taxonomy. For those reasons (leading to non-homogeneous taxonomies), gap-analyses and comparisons between studies become even more challenging and the curation processes to find potential consensus names are time-consuming. Here, we present a web-based application (Phytool), developed with ShinyApp (Rstudio), that aims to make the harmonisation of taxonomy easier and in a more efficient way, using a complete and up-to-date taxonomy reference database for freshwater microalgae. Phytool allows users to homogenise and update freshwater phytoplankton taxonomical names from sequence files and data tables directly uploaded in the application. It also gathers barcodes from curated references in a user-friendly way in which it is possible to search for specific organisms. All the data provided are downloadable with the possibility to apply filters in order to select only the required taxa and fields (e.g. specific taxonomic ranks). The main goal is to make accessible to a broad range of users the connection between microscopy and molecular biology and taxonomy through different ready-to-use functions. This study estimates that only 25% of species of freshwater phytoplankton in Phytobs are associated with a barcode. We plead for an increased effort to enrich reference databases by coupling taxonomy and molecular methods. Phytool should make this crucial work more efficient. The application is available at https://caninuzzo.shinyapps.io/phytool_v1/


2021 ◽  
Vol 17 (11) ◽  
pp. e1009581
Author(s):  
Michael S. Robeson ◽  
Devon R. O’Rourke ◽  
Benjamin D. Kaehler ◽  
Michal Ziemski ◽  
Matthew R. Dillon ◽  
...  

Nucleotide sequence and taxonomy reference databases are critical resources for widespread applications including marker-gene and metagenome sequencing for microbiome analysis, diet metabarcoding, and environmental DNA (eDNA) surveys. Reproducibly generating, managing, using, and evaluating nucleotide sequence and taxonomy reference databases creates a significant bottleneck for researchers aiming to generate custom sequence databases. Furthermore, database composition drastically influences results, and lack of standardization limits cross-study comparisons. To address these challenges, we developed RESCRIPt, a Python 3 software package and QIIME 2 plugin for reproducible generation and management of reference sequence taxonomy databases, including dedicated functions that streamline creating databases from popular sources, and functions for evaluating, comparing, and interactively exploring qualitative and quantitative characteristics across reference databases. To highlight the breadth and capabilities of RESCRIPt, we provide several examples for working with popular databases for microbiome profiling (SILVA, Greengenes, NCBI-RefSeq, GTDB), eDNA and diet metabarcoding surveys (BOLD, GenBank), as well as for genome comparison. We show that bigger is not always better, and reference databases with standardized taxonomies and those that focus on type strains have quantitative advantages, though may not be appropriate for all use cases. Most databases appear to benefit from some curation (quality filtering), though sequence clustering appears detrimental to database quality. Finally, we demonstrate the breadth and extensibility of RESCRIPt for reproducible workflows with a comparison of global hepatitis genomes. RESCRIPt provides tools to democratize the process of reference database acquisition and management, enabling researchers to reproducibly and transparently create reference materials for diverse research applications. RESCRIPt is released under a permissive BSD-3 license at https://github.com/bokulich-lab/RESCRIPt.


2021 ◽  
Vol 4 ◽  
Author(s):  
François Keck ◽  
Florian Altermatt

Reference databases of sequences that have been taxonomically assigned are a key element for DNA-based identification of organisms. Accurate and complete reference databases are necessary to associate a correct taxonomic name to the sequences obtained in studies using metabarcoding. Today many research projects using DNA metabarcoding include the development of a custom reference database, often derived from large repositories like GenBank. At the same time, many projects are focussing on the development of ready-to-use databases validated by experts and targeting specific markers and taxonomic groups. While mainstream tools such as spreadsheet softwares may be suitable to manage small databases, they quickly become insufficient when the amount of data increases and validation operations become more complex. There is a clear need for providing user‐friendly and powerful tools to manipulate biological sequences and manage reference databases. The R language which is a free software and has already been adopted by many researchers to perform their analyses is highly suitable to develop such tools. In this talk, we will outline the approach we recommend to handle small- to middle-sized reference databases, currently still making the majority of projects. We will advocate that a simple tabular approach where each sequence constitutes an observation may be the most adequate. While such a single table may be less flexible and less optimized than relational databases or more complex data structures, it is easy to maintain and allows the direct use of modern dataframe centric tools. We will specifically present and discuss two R packages that can be used jointly to make reference database development more accessible and more reproducible. First, we will briefly introduce bioseq (Keck 2020) which is dedicated to biological sequence manipulation and analysis. The package implements classes and functions to make analyses of complex datasets including DNA, RNA or protein sequences as simple as possible. The strength of bioseq is to provide standard and more advanced functions to perform low level operations through a simple and consistent programming interface. Then we will present refdb, which has been developed as an environment for semi-automatic and assisted construction of reference databases. The refdb package is a reference database manager offering a set of powerful functions to import, organize, clean, filter, audit and export the data. We will outline how these two packages together can speed up reference database generation and handling, and contribute to standardization and repeatability in metabarcoding studies.


2021 ◽  
Vol 4 ◽  
Author(s):  
Liz Davidson

DNA-based identification methods have been shown to have high detection capability and reduced costs compared to traditional methods and can also enable the detection of species that might be missed using traditional methods (e.g. rare species, cryptic species, larval stages). The success of DNA-based identification is dependent on the ‘DNA barcodes’ of target species being present in a barcode reference database. In order to use DNA-based identification methods to assess and monitor UK freshwater arthropods for biodiversity and ecological quality assessments, it is vital that comprehensive reference databases are available. Incomplete reference databases result in many sequences derived from metabarcoding not being assigned to species. Two current projects aim to create collections of high-quality sequences from expertly identified specimens of UK species. The Darwin Tree of Life project aims to sequence the genomes of all the eukaryotic species in Britain and Ireland and FreshBase aims to create a genomic reference collection for UK freshwater invertebrates. The Barcode of Life Data System (BOLD) is one of the main reference databases for animal barcodes. Prioritising the sequencing of UK freshwater arthropod species that are not yet represented in BOLD, would enable more complete identification of UK freshwater biodiversity using metabarcoding and would enable the development of primers to target specific arthropod groups or species. We analysed the coverage of UK freshwater arthropod species in BOLD. Our analyses show that coverage varies between taxonomic groups and large proportions of sequences in some orders are only represented by privately stored sequences in BOLD. Analyses of intra- and inter-specific variation in sequences stored in BOLD show that misidentifications or errors can reduce the barcode gap in some species which could cause difficulties in accurately identifying sequences derived from metabarcoding. Representation in BOLD by specimens from the UK is extremely low and analyses show that high geographic variation in sequences in some species could be important for accurate DNA-based identification of UK species. Our results have implications for prioritising the sequencing of UK freshwater arthropods and for the quality control of stored sequences in order to reduce the occurrence of misidentifications and errors that could impact the accuracy of DNA-based identification.


2021 ◽  
Vol 4 ◽  
Author(s):  
Sinziana Rivera ◽  
Valentin Vasselon ◽  
Frederic Rimet ◽  
Agnès Bouchez

Diatoms, macroinvertebrates and fish communities are widely used for the assessment of the ecological status of rivers and lakes. Metabarcoding studies of these communities are usually performed from “bulk” samples in the case of diatoms and macroinvertebrates; and from water samples in the case of fish. Recent studies, suggest that aquatic biofilms can physically act as environmental catchers of environmental DNA (eDNA) (e.g. Mariani et al. 2019). Thus, we propose an alternative metabarcoding approach to study macroinvertebrates and fishes directly from this matrix. The capacity of aquatic biofilms to catch macroinvertebrate eDNA was tested from a previous study in Mayotte Island were both biofilm samples and macroinvertebrate morphological inventories were available at same river sites (Rivera et al. 2021). First, macroinvertebrate specimens were identified based on their morphological characteristics. Second, DNA was extracted from biofilms, and macroinvertebrate communities were targeted using a standard COI barcode. The resulting morphological and molecular inventories were compared. Our results showed that both methods provided comparable structures and diversities for macroinvertebrate communities when using unassigned OTUs suggesting that macroinvertebrate DNA is present in biofilms and representative of the communities. However, after taxonomic assignment of OTUs, diversity and richness were no longer correlated. Indeed, many constraints were observed as the need for: a) more specific primers to avoid co-amplification of untargeted taxa inhabiting biofilms, b) primers targeting shorter barcodes to sequence more easily degraded eDNA that may be captured in biofilms, and c) a reference database well adapted to our tropical study sites. Finally, even if the results of this first study were encouraging, we wanted to test the biofilm approach on organisms that do not inhabit this environmental matrix in order to be able to distinguish between intra or extra-cellular DNA. Based on these observations, a second study looking for a fish eDNA signal in aquatic biofilms was performed. Environmental biofilm and water samples were collected in parallel at littoral sites at Lake Geneva. DNA was extracted from these samples, and fish communities were targeted using a standard 12S barcode. The molecular inventories derived from the biofilm and the water samples were compared. Both methods provide comparable floristic lists, providing a novel approach for ecological studies related to fish phenology using eDNA in biofilms. Our results open the door to the study of diatoms, macroinvertebrates and fish communities through metabarcoding from a single matrix reducing sampling efforts and costs.


2021 ◽  
Author(s):  
Bakhtiyor Sheraliev ◽  
Zuogang Peng

Uzbekistan is one of two doubly landlocked countries in the world, where all rivers are endorheic basins. Although fish diversity is relatively poor in Uzbekistan compared to other regions, the fish fauna of the region has not yet been fully studied. The aim of this study was to establish a reliable barcoding reference database for fish in Uzbekistan. A total of 666 specimens, belonging to 59 species within 39 genera, 16 families, and 9 orders, were subjected to polymerase chain reaction amplification in the barcode region and sequenced. The length of the 666 barcodes was 682 bp. The average K2P distances within species, genera, and families were 0.22%, 6.33%, and 16.46%, respectively. The average interspecific distance was approximately 28.8 times higher than the mean intraspecific distance. The Barcode Index Number (BIN) discordance report showed that 666 specimens represented 55 BINs, of which five were singletons, 45 were taxonomically concordant, and five were taxonomically discordant. The barcode gap analysis demonstrated that 89.3% of the fish species examined could be discriminated by DNA barcoding. These results provide new insights into fish diversity in the inland waters of Uzbekistan and can provide a basis for the development of further studies on fish fauna.


Sign in / Sign up

Export Citation Format

Share Document