RESCRIPt: Reproducible sequence taxonomy reference database management

Nucleotide sequence and taxonomy reference databases are critical resources for widespread applications including marker-gene and metagenome sequencing for microbiome analysis, diet metabarcoding, and environmental DNA (eDNA) surveys. Reproducibly generating, managing, using, and evaluating nucleotide sequence and taxonomy reference databases creates a significant bottleneck for researchers aiming to generate custom sequence databases. Furthermore, database composition drastically influences results, and lack of standardization limits cross-study comparisons. To address these challenges, we developed RESCRIPt, a Python 3 software package and QIIME 2 plugin for reproducible generation and management of reference sequence taxonomy databases, including dedicated functions that streamline creating databases from popular sources, and functions for evaluating, comparing, and interactively exploring qualitative and quantitative characteristics across reference databases. To highlight the breadth and capabilities of RESCRIPt, we provide several examples for working with popular databases for microbiome profiling (SILVA, Greengenes, NCBI-RefSeq, GTDB), eDNA and diet metabarcoding surveys (BOLD, GenBank), as well as for genome comparison. We show that bigger is not always better, and reference databases with standardized taxonomies and those that focus on type strains have quantitative advantages, though may not be appropriate for all use cases. Most databases appear to benefit from some curation (quality filtering), though sequence clustering appears detrimental to database quality. Finally, we demonstrate the breadth and extensibility of RESCRIPt for reproducible workflows with a comparison of global hepatitis genomes. RESCRIPt provides tools to democratize the process of reference database acquisition and management, enabling researchers to reproducibly and transparently create reference materials for diverse research applications. RESCRIPt is released under a permissive BSD-3 license at https://github.com/bokulich-lab/RESCRIPt.

Download Full-text

RESCRIPt: Reproducible sequence taxonomy reference database management for the masses

10.1101/2020.10.05.326504 ◽

2020 ◽

Cited By ~ 1

Author(s):

Michael S. Robeson ◽

Devon R. O’Rourke ◽

Benjamin D. Kaehler ◽

Michal Ziemski ◽

Matthew R. Dillon ◽

...

Keyword(s):

Nucleotide Sequence ◽

Marker Gene ◽

Environmental Dna ◽

Reference Sequence ◽

Genome Comparison ◽

Reference Database ◽

Reference Databases ◽

Quality Filtering ◽

Metagenome Sequencing ◽

The Masses

AbstractBackgroundNucleotide sequence and taxonomy reference databases are critical resources for widespread applications including marker-gene and metagenome sequencing for microbiome analysis, diet metabarcoding, and environmental DNA (eDNA) surveys. Reproducibly generating, managing, using, and evaluating nucleotide sequence and taxonomy reference databases creates a significant bottleneck for researchers aiming to generate custom sequence databases. Furthermore, database composition drastically influences results, and lack of standardizations limits cross-study comparisons. To address these challenges, we developed RESCRIPt, a software package for reproducible generation and management of reference sequence taxonomy databases, including dedicated functions that streamline creating databases from popular sources, and functions for evaluating, comparing, and interactively exploring qualitative and quantitative characteristics across reference databases.ResultsTo highlight the breadth and capabilities of RESCRIPt, we provide several examples for working with popular databases for microbiome profiling (SILVA, Greengenes, NCBI-RefSeq, GTDB), eDNA, and diet metabarcoding surveys (BOLD, GenBank), as well as for genome comparison. We show that bigger is not always better, and reference databases with standardized taxonomies and those that focus on type strains have quantitative advantages, though may not be appropriate for all use cases. Most databases appear to benefit from some curation (quality filtering), though sequence clustering appears detrimental to database quality. Finally, we demonstrate the breadth and extensibility of RESCRIPt for reproducible workflows with a comparison of global hepatitis genomes.ConclusionsRESCRIPt provides tools to democratize the process of reference database acquisition and management, enabling researchers to reproducibly and transparently create reference materials for diverse research applications. RESCRIPt is released under a permissive BSD-3 license at https://github.com/bokulich-lab/RESCRIPt.

Download Full-text

Overcoming limitations to environmental DNA studies: A coastal temperate reference sequence database for multiple chloroplast gene regions generated in a single assay.

10.22541/au.163252330.05592688/v1 ◽

2021 ◽

Author(s):

Nicole Foster ◽

Kor-jent Dijk ◽

Ed Biffin ◽

Jennifer Young ◽

Vicki Thomson ◽

...

Keyword(s):

Dna Sequences ◽

Dna Barcode ◽

Environmental Dna ◽

Reference Sequence ◽

Reference Database ◽

Chloroplast Gene ◽

Coastal Plants ◽

Reference Databases ◽

Targeted Capture ◽

Comprehensive Reference

A proliferation in environmental DNA (eDNA) research has increased the reliance on reference sequence databases to assign unknown DNA sequences to known taxa. Without comprehensive reference databases, DNA extracted from environmental samples cannot be correctly assigned to taxa, limiting the use of this genetic information to identify organisms in unknown sample mixtures. For animals, standard metabarcoding practices involve amplification of the mitochondrial Cytochrome-c oxidase subunit 1 (CO1) region, which is a universally amplifyable region across majority of animal taxa. This region, however, does not work well as a DNA barcode for plants and fungi, and there is no similar universal single barcode locus that has the same species resolution. Therefore, generating reference sequences has been more difficult and several loci have been suggested to be used in parallel to get to species identification. For this reason, we developed a multi-gene targeted capture approach to generate reference DNA sequences for plant taxa across 20 target chloroplast gene regions in a single assay. We successfully compiled a reference database for 93 temperate coastal plants including seagrasses, mangroves, and saltmarshes/samphire’s. We demonstrate the importance of a comprehensive reference database to prevent species going undetected in eDNA studies. We also investigate how using multiple chloroplast gene regions impacts the ability to discriminate between taxa.

Download Full-text

PEMA v2: addressing metabarcoding bioinformatics analysis challenges

ARPHA Conference Abstracts ◽

10.3897/aca.4.e64902 ◽

2021 ◽

Vol 4 ◽

Author(s):

Haris Zafeiropoulos ◽

Christina Pavloudi ◽

Evangelos Pafilis

Keyword(s):

High Performance ◽

Bioinformatics Analysis ◽

Marker Gene ◽

Environmental Dna ◽

Third Party ◽

Reference Database ◽

Marker Genes ◽

Specific Reference ◽

Taxonomic Assignment ◽

Internal Joint

Environmental DNA (eDNA) and metabarcoding have launched a new era in bio- and eco-assessment over the last years (Ruppert et al. 2019). The simultaneous identification, at the lowest taxonomic level possible, of a mixture of taxa from a great range of samples is now feasible; thus, the number of eDNA metabarcoding studies has increased radically (Deiner and 2017). While the experimental part of eDNA metabarcoding can be rather challenging depending on the special characteristics of the different studies, computational issues are considered to be its major bottlenecks. Among the latter, the bioinformatics analysis of metabarcoding data and especially the taxonomy assignment of the sequences are fundamental challenges. Many steps are required to obtain taxonomically assigned matrices from raw data. For most of these, a plethora of tools are available. However, each tool's execution parameters need to be tailored to reflect each experiment's idiosyncrasy; thus, tuning bioinformatics analysis has proved itself fundamental (Kamenova 2020). The computation capacity of high-performance computing systems (HPC) is frequently required for such analyses. On top of that, the non perfect completeness and correctness of the reference taxonomy databases is another important issue (Loos et al. 2020). Based on third-party tools, we have developed the Pipeline for Environmental Metabarcoding Analysis (PEMA), a HPC-centered, containerized assembly of key metabarcoding analysis tools. PEMA combines state-of-the art technologies and algorithms with an easy to get-set-use framework, allowing researchers to tune thoroughly each study thanks to roll-back checkpoints and on-demand partial pipeline execution features (Zafeiropoulos 2020). Once PEMA was released, there were two main pitfalls soon to be highlighted by users. PEMA supported 4 marker genes and was bounded by specific reference databases. In this new version of PEMA the analysis of any marker gene is now available since a new feature was added, allowing classifiers to train a user-provided reference database and use it for taxonomic assignment. Fig. 1 shows the taxonomy assignment related PEMA modules; all those out of the dashed box have been developed for this new PEMA release. As shown, the RDPClassifier has been trained with Midori reference 2 and has been added as an option, classifying not only metazoans but sequences from all taxonomic groups of Eukaryotes for the case of the COI marker gene. A PEMA documentation site is now also available. PEMA.v2 containers are available via the DockerHub and SingularityHub as well as through the Elixir Greece AAI Service. It has also been selected to be part of the LifeWatch ERIC Internal Joint Initiative for the analysis of ARMS data and soon will be available through the Tesseract VRE.

Download Full-text

Exact sequence variants should replace operational taxonomic units in marker gene data analysis

10.1101/113597 ◽

2017 ◽

Cited By ~ 7

Author(s):

Benjamin J Callahan ◽

Paul J McMurdie ◽

Susan P Holmes

Keyword(s):

De Novo ◽

Marker Gene ◽

Taxonomic Resolution ◽

Reference Database ◽

Sequence Variants ◽

Sequencing Data ◽

Operational Taxonomic Units ◽

The Status ◽

Reference Databases ◽

Gene Data

AbstractRecent advances have made it possible to analyze high-throughput marker-gene sequencing data without resorting to the customary construction of molecular operational taxonomic units (OTUs): clusters of sequencing reads that differ by less than a fixed dissimilarity threshold. New methods control errors sufficiently that sequence variants (SVs) can be resolved exactly, down to the level of single-nucleotide differences over the sequenced gene region. The benefits of finer taxonomic resolution are immediately apparent, and arguments for SV methods have focused on their improved resolution. Less obvious, but we believe more important, are the broad benefits deriving from the status of SVs as consistent labels with intrinsic biological meaning identified independently from a reference database. Here we discuss how those features grant SVs the combined advantages of closed-reference OTUs — including computational costs that scale linearly with study size, simple merging between independently processed datasets, and forward prediction — and of de novo OTUs — including accurate diversity measurement and applicability to communities lacking deep coverage in reference databases. We argue that the improvements in reusability, reproducibility and comprehensiveness are sufficiently great that SVs should replace OTUs as the standard unit of marker gene analysis and reporting.

Download Full-text

Are Genetic Reference Libraries Sufficient for Environmental DNA Metabarcoding of Mekong River Basin Fish?

Water ◽

10.3390/w13131767 ◽

2021 ◽

Vol 13 (13) ◽

pp. 1767

Author(s):

Christopher L. Jerde ◽

Andrew R. Mahon ◽

Teresa Campbell ◽

Mary E. McElroy ◽

Kakada Pin ◽

...

Keyword(s):

River Basin ◽

Fish Species ◽

Sequence Similarity ◽

Environmental Dna ◽

Mekong River ◽

Reference Sequence ◽

Reference Database ◽

Surveillance Program ◽

Sequence Coverage ◽

Mekong River Basin

Environmental DNA (eDNA) metabarcoding approaches to surveillance have great potential for advancing biodiversity monitoring and fisheries management. For eDNA metabarcoding, having a genetic reference sequence identified to fish species is vital to reduce detection errors. Detection errors will increase when there is no reference sequence for a species or when the reference sequence is the same between different species at the same sequenced region of DNA. These errors will be acute in high biodiversity systems like the Mekong River Basin, where many fish species have no reference sequences and many congeners have the same or very similar sequences. Recently developed tools allow for inspection of reference database coverage and the sequence similarity between species. These evaluation tools provide a useful pre-deployment approach to evaluate the breadth of fish species richness potentially detectable using eDNA metabarcoding. Here we combined established species lists for the Mekong River Basin, resulting in a list of 1345 fish species, evaluated the genetic library coverage across 23 peer-reviewed primer pairs, and measured the species specificity for one primer pair across four genera to demonstrate that coverage of genetic reference libraries is but one consideration before deploying an eDNA metabarcoding surveillance program. This analysis identifies many of the eDNA metabarcoding knowledge gaps with the aim of improving the reliability of eDNA metabarcoding applications in the Mekong River Basin. Genetic reference libraries perform best for common and commercially valuable Mekong fishes, while sequence coverage does not exist for many regional endemics, IUCN data deficient, and threatened fishes.

Download Full-text

Contamination in Reference Sequence Databases: Time for Divide-and-Rule Tactics

Frontiers in Microbiology ◽

10.3389/fmicb.2021.755101 ◽

2021 ◽

Vol 12 ◽

Author(s):

Valérian Lupo ◽

Mick Van Vlierberghe ◽

Hervé Vanderschuren ◽

Frédéric Kerff ◽

Denis Baurain ◽

...

Keyword(s):

Reference Sequence ◽

Reference Database ◽

Contamination Level ◽

Gene Markers ◽

Genome Wide ◽

A Genome ◽

Genomic Studies ◽

Reference Databases ◽

Single Method ◽

Divide And Rule

Contaminating sequences in public genome databases is a pervasive issue with potentially far-reaching consequences. This problem has attracted much attention in the recent literature and many different tools are now available to detect contaminants. Although these methods are based on diverse algorithms that can sometimes produce widely different estimates of the contamination level, the majority of genomic studies rely on a single method of detection, which represents a risk of systematic error. In this work, we used two orthogonal methods to assess the level of contamination among National Center for Biotechnological Information Reference Sequence Database (RefSeq) bacterial genomes. First, we applied the most popular solution, CheckM, which is based on gene markers. We then complemented this approach by a genome-wide method, termed Physeter, which now implements a k-folds algorithm to avoid inaccurate detection due to potential contamination of the reference database. We demonstrate that CheckM cannot currently be applied to all available genomes and bacterial groups. While it performed well on the majority of RefSeq genomes, it produced dubious results for 12,326 organisms. Among those, Physeter identified 239 contaminated genomes that had been missed by CheckM. In conclusion, we emphasize the importance of using multiple methods of detection while providing an upgrade of our own detection tool, Physeter, which minimizes incorrect contamination estimates in the context of unavoidably contaminated reference databases.

Download Full-text

Assessing accuracy and completeness of GenBank for eDNA metabarcoding: towards a reliable marine fish reference database

ARPHA Conference Abstracts ◽

10.3897/aca.4.e64671 ◽

2021 ◽

Vol 4 ◽

Author(s):

Cristina Claver ◽

Oriol Canals ◽

Naiara Rodriguez-Ezpeleta

Keyword(s):

Ribosomal Rna ◽

Gap Analysis ◽

Environmental Dna ◽

Reference Database ◽

Fish Diversity ◽

Cyt B ◽

Taxonomic Assignment ◽

Reference Databases ◽

Taxonomic Groups ◽

High Level

Environmental DNA (eDNA) metabarcoding, the process of sequencing DNA collected from the environment for producing biodiversity inventories, is increasingly being applied to assess fish diversity and distribution in marine environments. Yet, the successful application of this technique deeply relies on accurate and complete reference databases used for taxonomic assignment. The most used markers for fish eDNA metabarcoding studies are the cytochrome C oxidase subunit 1 (COI), 16S ribosomal RNA (16S), the 12S ribosomal RNA (12S) and cytochrome b (cyt b) genes, whose sequences are usually retrieved from GenBank, the largest DNA sequence database that represents a worldwide public resource for genetic studies. Thus, the completeness and accuracy of GenBank is critical to derive reliable estimations from fish eDNA metabarcoding data. Here, we have i) compiled the checklist of European marine fishes, ii) performed a gap analysis of the four genes and, within COI and 12S, also of the most used barcodes for fish, and iii) developed a workflow to detect potentially incorrect records in GenBank. We found that from the 1965 species in the checklist (1761 Actinopterygii, 189 Elasmobranchii, 9 Holocephali, 4 Petromyzonti and 2 Myxini), about 70% have sequences for COI, whereas less have sequences for 12S, 16S and cyt b (45-55%). Among the species for which COI ad 12S sequences are available, about 60% and 40% have sequences covering the most used barcodes respectively. The analysis of pairwise distances between sequences revealed pairs belonging to the same species with significantly low similarity and pairs belonging to different high level taxonomic groups (class, order) with significantly large similarity. In light of this further confirmation of presence of a substantial number of incorrect records in GenBank, we propose a method for identifying and removing spurious sequences to create reliable and accurate reference databases for eDNA metabarcoding.

Download Full-text

ensembleTax: an R package for determinations of ensemble taxonomic assignments of phylogenetically-informative marker gene sequences

PeerJ ◽

10.7717/peerj.11865 ◽

2021 ◽

Vol 9 ◽

pp. e11865

Author(s):

Dylan Catlett ◽

Kevin Son ◽

Connie Liang

Keyword(s):

Marker Gene ◽

Ensemble Methods ◽

R Package ◽

Reference Database ◽

Gene Sequences ◽

Taxonomic Assignment ◽

Informative Marker ◽

Data Set ◽

Reference Databases ◽

Taxonomic Assignments

Background High-throughput sequencing of phylogenetically informative marker genes is a widely used method to assess the diversity and composition of microbial communities. Taxonomic assignment of sampled marker gene sequences (referred to as amplicon sequence variants, or ASVs) imparts ecological significance to these genetic data. To assign taxonomy to an ASV, a taxonomic assignment algorithm compares the ASV to a collection of reference sequences (a reference database) with known taxonomic affiliations. However, many taxonomic assignment algorithms and reference databases are available, and the optimal algorithm and database for a particular scientific question is often unclear. Here, we present the ensembleTax R package, which provides an efficient framework for integrating taxonomic assignments predicted with any number of taxonomic assignment algorithms and reference databases to determine ensemble taxonomic assignments for ASVs. Methods The ensembleTax R package relies on two core algorithms: taxmapper and assign.ensembleTax. The taxmapper algorithm maps taxonomic assignments derived from one reference database onto the taxonomic nomenclature (a set of taxonomic naming and ranking conventions) of another reference database. The assign.ensembleTax algorithm computes ensemble taxonomic assignments for each ASV in a data set based on any number of taxonomic assignments determined with independent methods. Various parameters allow analysts to prioritize obtaining either more ASVs with more predicted clade names or more robust clade name predictions supported by multiple independent methods in ensemble taxonomic assignments. Results The ensembleTax R package is used to compute two sets of ensemble taxonomic assignments for a collection of protistan ASVs sampled from the coastal ocean. Comparisons of taxonomic assignments predicted by individual methods with those predicted by ensemble methods show that conservative implementations of the ensembleTax package minimize disagreements between taxonomic assignments predicted by individual and ensemble methods, but result in ASVs with fewer ranks assigned taxonomy. Less conservative implementations of the ensembleTax package result in an increased fraction of ASVs classified at all taxonomic ranks, but increase the number of ASVs for which ensemble assignments disagree with those predicted by individual methods. Discussion We discuss how implementation of the ensembleTax R package may be optimized to address specific scientific objectives based on the results of the application of the ensembleTax package to marine protist communities. While further work is required to evaluate the accuracy of ensemble taxonomic assignments relative to taxonomic assignments predicted by individual methods, we also discuss scenarios where ensemble methods are expected to improve the accuracy of taxonomy prediction for ASVs.

Download Full-text

Adaptability of Ultrasonic Lamb Wave Touchscreen to the Variations in Touch Force and Touch Area

Sensors ◽

10.3390/s21051736 ◽

2021 ◽

Vol 21 (5) ◽

pp. 1736

Author(s):

Zengchong Yang ◽

Xiucheng Liu ◽

Bin Wu ◽

Ren Liu

Keyword(s):

Lamb Wave ◽

Weight Coefficient ◽

The Self ◽

Reference Database ◽

Learning Method ◽

Improved Method ◽

Large Area ◽

Localization Model ◽

Reference Databases ◽

First Time

Previous studies on Lamb wave touchscreen (LWT) were carried out based on the assumption that the unknown touch had the consistent parameters with acoustic fingerprints in the reference database. The adaptability of LWT to the variations in touch force and touch area was investigated in this study for the first time. The automatic collection of the databases of acoustic fingerprints was realized with an experimental prototype of LWT employing three pairs of transmitter–receivers. The self-adaptive updated weight coefficient of the used transmitter–receiver pairs was employed to successfully improve the accuracy of the localization model established based on a learning method. The performance of the improved method in locating single- and two-touch actions with the reference database of different parameters was carefully evaluated. The robustness of the LWT to the variation of the touch force varied with the touch area. Moreover, it was feasible to locate touch actions of large area with reference databases of small touch areas as long as the unknown touch and the reference databases met the condition of equivalent averaged stress.

Download Full-text

Toward a global reference database of COI barcodes for marine zooplankton

Marine Biology ◽

10.1007/s00227-021-03887-y ◽

2021 ◽

Vol 168 (6) ◽

Author(s):

Ann Bucklin ◽

Katja T. C. A. Peijnenburg ◽

Ksenia N. Kosobokova ◽

Todd D. O’Brien ◽

Leocadio Blanco-Bercial ◽

...

Keyword(s):

Species Diversity ◽

Dna Sequences ◽

Reference Sequence ◽

Global Ocean ◽

Reference Database ◽

Data Repositories ◽

Marine Zooplankton ◽

The North ◽

Coi Sequences ◽

Taxonomic Groups

AbstractCharacterization of species diversity of zooplankton is key to understanding, assessing, and predicting the function and future of pelagic ecosystems throughout the global ocean. The marine zooplankton assemblage, including only metazoans, is highly diverse and taxonomically complex, with an estimated ~28,000 species of 41 major taxonomic groups. This review provides a comprehensive summary of DNA sequences for the barcode region of mitochondrial cytochrome oxidase I (COI) for identified specimens. The foundation of this summary is the MetaZooGene Barcode Atlas and Database (MZGdb), a new open-access data and metadata portal that is linked to NCBI GenBank and BOLD data repositories. The MZGdb provides enhanced quality control and tools for assembling COI reference sequence databases that are specific to selected taxonomic groups and/or ocean regions, with associated metadata (e.g., collection georeferencing, verification of species identification, molecular protocols), and tools for statistical analysis, mapping, and visualization. To date, over 150,000 COI sequences for ~ 5600 described species of marine metazoan plankton (including holo- and meroplankton) are available via the MZGdb portal. This review uses the MZGdb as a resource for summaries of COI barcode data and metadata for important taxonomic groups of marine zooplankton and selected regions, including the North Atlantic, Arctic, North Pacific, and Southern Oceans. The MZGdb is designed to provide a foundation for analysis of species diversity of marine zooplankton based on DNA barcoding and metabarcoding for assessment of marine ecosystems and rapid detection of the impacts of climate change.

Download Full-text