scholarly journals Fast and sensitive taxonomic assignment to metagenomic contigs

2020 ◽  
Author(s):  
M. Mirdita ◽  
M. Steinegger ◽  
F. Breitwieser ◽  
J. Söding ◽  
E. Levy Karin

SummaryMMseqs2 taxonomy is a new tool to assign taxonomic labels to metagenomic contigs. It extracts all possible protein fragments from each contig, quickly retains those that can contribute to taxonomic annotation, assigns them with robust labels and determines the contig’s taxonomic identity by weighted voting. Its fragment extraction step is suitable for the analysis of all domains of life. MMseqs2 taxonomy is 2-18x faster than state-of-the-art tools and also contains new modules for creating and manipulating taxonomic reference databases as well as reporting and visualizing taxonomic assignments.AvailabilityMMseqs2 taxonomy is part of the MMseqs2 free open-source software package available for Linux, macOS and Windows at https://mmseqs.com.

2020 ◽  
Author(s):  
Teresita M. Porter ◽  
Mehrdad Hajibabaei

AbstractBackgroundMulti-marker metabarcoding is increasingly being used to generate biodiversity information across different domains of life from microbes to fungi to animals such as in ecological and environmental studies. Current popular bioinformatic pipelines support microbial and fungal marker analysis, while ad hoc methods are used to process animal metabarcode markers from the same study. The purpose of this paper is to introduce MetaWorks, a ‘meta’barcode pipeline that does ‘the works’ and supports the bioinformatic processing of various metabarcoding markers including rRNA and their spacers as well as protein coding loci.ResultsMetaWorks provides a Conda environment to quickly gather most of the programs and dependencies for the pipeline. MetaWorks is automated using Snakemake to ensure reproducibility and scalability. We have supplemented existing RDP-trained classifiers for SSU (prokaryotes), ITS (fungi), and LSU (fungi) with trained classifiers for COI (eukaryotes), rbcL (diatoms or eukaryotes), SSU (diatoms or eukaryotes), and 12S (fish). MetaWorks can process rRNA genes, but it can also properly handle ITS spacers by trimming flanking conserved rRNA gene regions, as well as handle protein coding genes by removing obvious pseudogenes.ConclusionsAs far as we are aware, MetaWorks is the first flexible multi-marker metabarcode pipeline that can accommodate rRNA genes, spacer, and protein coding markers in the same pipeline. This is ideal for large-scale, multi-marker studies to provide a harmonized processing environment, pipeline, and taxonomic assignment approach. Updates to MetaWorks will be made as needed to reflect advances in the underlying programs, reference databases, or hidden Markov model (HMM) profiles for pseudogene filtering. Future developments will include support for additional metabarcode markers, RDP trained reference databases, and HMM profiles for pseudogene filtering.


PeerJ ◽  
2021 ◽  
Vol 9 ◽  
pp. e11865
Author(s):  
Dylan Catlett ◽  
Kevin Son ◽  
Connie Liang

Background High-throughput sequencing of phylogenetically informative marker genes is a widely used method to assess the diversity and composition of microbial communities. Taxonomic assignment of sampled marker gene sequences (referred to as amplicon sequence variants, or ASVs) imparts ecological significance to these genetic data. To assign taxonomy to an ASV, a taxonomic assignment algorithm compares the ASV to a collection of reference sequences (a reference database) with known taxonomic affiliations. However, many taxonomic assignment algorithms and reference databases are available, and the optimal algorithm and database for a particular scientific question is often unclear. Here, we present the ensembleTax R package, which provides an efficient framework for integrating taxonomic assignments predicted with any number of taxonomic assignment algorithms and reference databases to determine ensemble taxonomic assignments for ASVs. Methods The ensembleTax R package relies on two core algorithms: taxmapper and assign.ensembleTax. The taxmapper algorithm maps taxonomic assignments derived from one reference database onto the taxonomic nomenclature (a set of taxonomic naming and ranking conventions) of another reference database. The assign.ensembleTax algorithm computes ensemble taxonomic assignments for each ASV in a data set based on any number of taxonomic assignments determined with independent methods. Various parameters allow analysts to prioritize obtaining either more ASVs with more predicted clade names or more robust clade name predictions supported by multiple independent methods in ensemble taxonomic assignments. Results The ensembleTax R package is used to compute two sets of ensemble taxonomic assignments for a collection of protistan ASVs sampled from the coastal ocean. Comparisons of taxonomic assignments predicted by individual methods with those predicted by ensemble methods show that conservative implementations of the ensembleTax package minimize disagreements between taxonomic assignments predicted by individual and ensemble methods, but result in ASVs with fewer ranks assigned taxonomy. Less conservative implementations of the ensembleTax package result in an increased fraction of ASVs classified at all taxonomic ranks, but increase the number of ASVs for which ensemble assignments disagree with those predicted by individual methods. Discussion We discuss how implementation of the ensembleTax R package may be optimized to address specific scientific objectives based on the results of the application of the ensembleTax package to marine protist communities. While further work is required to evaluate the accuracy of ensemble taxonomic assignments relative to taxonomic assignments predicted by individual methods, we also discuss scenarios where ensemble methods are expected to improve the accuracy of taxonomy prediction for ASVs.


2017 ◽  
Author(s):  
Zhemin Zhou ◽  
Nina Luhmann ◽  
Nabil-Fareed Alikhan ◽  
Christopher Quince ◽  
Mark Achtman

AbstractExploring the genetic diversity of microbes within the environment through metagenomic sequencing first requires classifying these reads into taxonomic groups. Current methods compare these sequencing data with existing biased and limited reference databases. Several recent evaluation studies demonstrate that current methods either lack sufficient sensitivity for species-level assignments or suffer from false positives, overestimating the number of species in the metagenome. Both are especially problematic for the identification of low-abundance microbial species, e. g. detecting pathogens in ancient metagenomic samples. We present a new method, SPARSE, which improves taxonomic assignments of metagenomic reads. SPARSE balances existing biased reference databases by grouping reference genomes into similarity-based hierarchical clusters, implemented as an efficient incremental data structure. SPARSE assigns reads to these clusters using a probabilistic model, which specifically penalizes non-specific mappings of reads from unknown sources and hence reduces false-positive assignments. Our evaluation on simulated datasets from two recent evaluation studies demonstrated the improved precision of SPARSE in comparison to other methods for species-level classification. In a third simulation, our method successfully differentiated multiple co-existing Escherichia coli strains from the same sample. In real archaeological datasets, SPARSE identified ancient pathogens with ≤ 0.02% abundance, consistent with published findings that required additional sequencing data. In these datasets, other methods either missed targeted pathogens or reported non-existent ones. SPARSE and all evaluation scripts are available at https://github.com/zheminzhou/SPARSE.


2019 ◽  
Author(s):  
Andreas Bremges ◽  
Adrian Fritz ◽  
Alice C. McHardy

The number of microbial genome sequences is growing exponentially, also thanks to recent advances in recovering complete or near-complete genomes from metagenomes and single cells. Assigning reliable taxon labels to genomes is key and often a prerequisite for downstream analyses. We introduce CAMITAX, a scalable and reproducible workflow for the taxonomic labelling of microbial genomes recovered from isolates, single cells, and metagenomes. CAMI-TAX combines genome distance-, 16S rRNA gene-, and gene homology-based taxonomic assignments with phylogenetic placement. It uses Nextflow to orchestrate reference databases and software containers, and thus combines ease of installation and use with computational re-producibility. We evaluated the method on several hundred metagenome-assembled genomes with high-quality taxonomic annotations from the TARA Oceans project, and show that the ensemble classification method in CAMITAX improved on all individual methods across tested ranks. While we initially developed CAMITAX to aid the Critical Assessment of Metagenome Interpretation (CAMI) initiative, it evolved into a comprehensive software to reliably assign taxon labels to microbial genomes. CAMITAX is available under the Apache License 2.0 at: https://github.com/CAMI-challenge/CAMITAX


2020 ◽  
Vol 36 (18) ◽  
pp. 4699-4705
Author(s):  
Hamid Bagheri ◽  
Andrew J Severin ◽  
Hridesh Rajan

Abstract Motivation As the cost of sequencing decreases, the amount of data being deposited into public repositories is increasing rapidly. Public databases rely on the user to provide metadata for each submission that is prone to user error. Unfortunately, most public databases, such as non-redundant (NR), rely on user input and do not have methods for identifying errors in the provided metadata, leading to the potential for error propagation. Previous research on a small subset of the NR database analyzed misclassification based on sequence similarity. To the best of our knowledge, the amount of misclassification in the entire database has not been quantified. We propose a heuristic method to detect potentially misclassified taxonomic assignments in the NR database. We applied a curation technique and quality control to find the most probable taxonomic assignment. Our method incorporates provenance and frequency of each annotation from manually and computationally created databases and clustering information at 95% similarity. Results We found more than two million potentially taxonomically misclassified proteins in the NR database. Using simulated data, we show a high precision of 97% and a recall of 87% for detecting taxonomically misclassified proteins. The proposed approach and findings could also be applied to other databases. Availability and implementation Source code, dataset, documentation, Jupyter notebooks and Docker container are available at https://github.com/boalang/nr. Supplementary information Supplementary data are available at Bioinformatics online.


2021 ◽  
Vol 22 (S6) ◽  
Author(s):  
Haoran Ma ◽  
Tin Wee Tan ◽  
Kenneth Hon Kim Ban

Abstract Background Taxonomic assignment is a key step in the identification of human viral pathogens. Current tools for taxonomic assignment from sequencing reads based on alignment or alignment-free k-mer approaches may not perform optimally in cases where the sequences diverge significantly from the reference sequences. Furthermore, many tools may not incorporate the genomic coverage of assigned reads as part of overall likelihood of a correct taxonomic assignment for a sample. Results In this paper, we describe the development of a pipeline that incorporates a multi-task learning model based on convolutional neural network (MT-CNN) and a Bayesian ranking approach to identify and rank the most likely human virus from sequence reads. For taxonomic assignment of reads, the MT-CNN model outperformed Kraken 2, Centrifuge, and Bowtie 2 on reads generated from simulated divergent HIV-1 genomes and was more sensitive in identifying SARS as the closest relation in four RNA sequencing datasets for SARS-CoV-2 virus. For genomic region assignment of assigned reads, the MT-CNN model performed competitively compared with Bowtie 2 and the region assignments were used for estimation of genomic coverage that was incorporated into a naïve Bayesian network together with the proportion of taxonomic assignments to rank the likelihood of candidate human viruses from sequence data. Conclusions We have developed a pipeline that combines a novel MT-CNN model that is able to identify viruses with divergent sequences together with assignment of the genomic region, with a Bayesian approach to ranking of taxonomic assignments by taking into account both the number of assigned reads and genomic coverage. The pipeline is available at GitHub via https://github.com/MaHaoran627/CNN_Virus.


2016 ◽  
Author(s):  
Alexey M. Kozlov ◽  
Jiajie Zhang ◽  
Pelin Yilmaz ◽  
Frank Oliver Glöckner ◽  
Alexandros Stamatakis

AbstractMolecular sequences in public databases are mostly annotated by the submitting authors without further validation. This procedure can generate erroneous taxonomic sequence labels. Mislabeled sequences are hard to identify, and they can induce downstream errors because new sequences are typically annotated using existing ones. Furthermore, taxonomic mislabelings in reference sequence databases can bias metagenetic studies which rely on the taxonomy. Despite significant efforts to improve the quality of taxonomic annotations, the curation rate is low because of the labour-intensive manual curation process.Here, we present SATIVA, a phylogeny-aware method to automatically identify taxonomically mislabeled sequences (“mislabels”) using statistical models of evolution. We use the Evolutionary Placement Algorithm (EPA) to detect and score sequences whose taxonomic annotation is not supported by the underlying phylogenetic signal, and automatically propose a corrected taxonomic classification for those. Using simulated data, we show that our method attains high accuracy for identification (96.9% sensitivity / 91.7% precision) as well as correction (94.9% sensitivity / 89.9% precision) of mislabels. Furthermore, an analysis of four widely used microbial 16S reference databases (Greengenes, LTP, RDP and SILVA) indicates that they currently contain between 0.2% and 2.5% mislabels. Finally, we use SATIVA to perform an in-depth evaluation of alternative taxonomies for Cyanobacteria.SATIVA is freely available at https://github.com/amkozlov/sativa.


2017 ◽  
Author(s):  
Jan-Niklas Macher ◽  
Till-Hendrik Macher ◽  
Florian Leese

Metabarcoding and metagenomic approaches are becoming routine techniques in biodiversity assessment and ecological studies. The assignment of taxonomic information to sequences is challenging, as many reference libraries are lacking information on certain taxonomic groups and can contain erroneous sequences. Combining different reference databases is therefore a promising approach for maximizing taxonomic coverage and reliability of results. This tutorial shows how to use the “BOLD_NCBI_Merger” script to combine sequence data obtained from the National Center for Biotechnology Information (NCBI) GenBank and the Barcode of Life Database (BOLD) and prepare it for taxonomic assignment with the software MEGAN.


Author(s):  
Owen S Wangensteen ◽  
Creu Palacín ◽  
Magdalena Guardiola ◽  
Xavier Turon

We developed a metabarcoding method for biodiversity characterization of structurally complex natural marine hard-bottom communities. Novel primer sets for two different molecular markers: the “Leray fragment” of mitochondrial cytochrome c oxidase, COI, and the V7 region of ribosomal RNA 18S were used to analyse eight different marine shallow benthic communities from two National Parks in Spain (one in the Atlantic Ocean and another in the Mediterranean Sea). Samples were sieved into three size fractions from where DNA was extracted separately. Bayesian clustering was used for delimiting molecular operational taxonomic units (MOTUs) and custom reference databases were constructed for taxonomic assignment. We found unexpectedly high values for MOTU richness, suggesting that these communities host a large amount of yet undescribed eukaryotic biodiversity. Significant gaps are still found in sequence reference databases, which currently prevent the complete taxonomic assignation of the detected sequences. Nevertheless, over 90% (in abundance) of the sequenced reads could be successfully assigned to phylum or lower taxonomical level. This identification rate might be significantly improved in the future, as reference databases are updated. Our results show that marine metabarcoding, currently applied mostly to plankton or sediments, can be adapted to structurally complex hard bottom samples, and emerges as a robust, fast, objective and affordable method for comprehensively characterizing the diversity of marine benthic communities dominated by macroscopic seaweeds and colonial or modular sessile metazoans, allowing for standardized biomonitoring of these ecologically important communities. The new universal primers for COI can potentially be used for biodiversity assessment with high taxonomic resolution in a wide array of marine, terrestrial or freshwater eukaryotic communities.


2020 ◽  
Author(s):  
Evangelos A. Dimopoulos ◽  
Alberto Carmagnini ◽  
Irina M. Velsko ◽  
Christina Warinner ◽  
Greger Larson ◽  
...  

Identification of specific species in metagenomic samples is critical for several key applications, yet many tools available require large computational power and are often prone to false positive identifications. Here we describe High-AccuracY and Scalable Taxonomic Assignment of MetagenomiC data (HAYSTAC), which can estimate the probability that a specific taxon is present in a metagenome. HAYSTAC provides a user-friendly tool to construct databases, based on publicly available genomes, that are used for competitive reads mapping. It then uses a novel Bayesian framework to infer the abundance and statistical support for each species identification, and provide per-read species classification. Unlike other methods, HAYSTAC is specifically designed to efficiently handle both ancient and modern DNA data, as well as incomplete reference databases, making it possible to run highly accurate hypothesis-driven analyses (i.e., assessing the presence of a specific species) on variably sized reference databases while dramatically improving processing speeds. We tested the performance and accuracy of HAYSTAC using simulated Illumina libraries, both with and without ancient DNA damage, and compared the results to other currently available methods (i.e., Kraken2/Braken, MALT/HOPS, and Sigma). HAYSTAC identified fewer false positives than both Kraken2/Braken and MALT in all simulations, and fewer than Sigma in simulations of ancient data. It uses less memory than Kraken2/Braken as well as MALT both during database construction and sample analysis. Lastly, we used HAYSTAC to search for specific pathogens in two published ancient metagenomic datasets, demonstrating how it can be applied to empirical datasets. HAYSTAC is available from https://github.com/antonisdim/HAYSTAC


Sign in / Sign up

Export Citation Format

Share Document