scholarly journals Optimizing taxonomic classification of marker-gene amplicon sequences with QIIME 2’s q2-feature-classifier plugin

Microbiome ◽  
2018 ◽  
Vol 6 (1) ◽  
Author(s):  
Nicholas A. Bokulich ◽  
Benjamin D. Kaehler ◽  
Jai Ram Rideout ◽  
Matthew Dillon ◽  
Evan Bolyen ◽  
...  
Author(s):  
Nicholas A Bokulich ◽  
Benjamin D Kaehler ◽  
Jai Ram Rideout ◽  
Matthew Dillon ◽  
Evan Bolyen ◽  
...  

Background: Taxonomic classification of marker-gene sequences is an important step in microbiome analysis. Results: We present q2-feature-classifier ( https://github.com/qiime2/q2-feature-classifier ), a QIIME 2 plugin containing several novel machine-learning and alignment-based taxonomy classifiers that meet or exceed the accuracy of existing methods for marker-gene amplicon sequence classification. We evaluated and optimized several commonly used taxonomic classification methods (RDP, BLAST, UCLUST) and several new methods (a scikit-learn naive Bayes machine-learning classifier, and alignment-based taxonomy consensus methods of VSEARCH, BLAST+, and SortMeRNA) for classification of marker-gene amplicon sequence data. Conclusions: Our results illustrate the importance of parameter tuning for optimizing classifier performance, and we make recommendations regarding parameter choices for a range of standard operating conditions. q2-feature-classifier and our evaluation framework, tax-credit, are both free, open-source, BSD-licensed packages available on GitHub.


Author(s):  
Nicholas A Bokulich ◽  
Benjamin D Kaehler ◽  
Jai Ram Rideout ◽  
Matthew Dillon ◽  
Evan Bolyen ◽  
...  

Background. Taxonomic classification of marker-gene sequences is an important step in microbiome analysis. Results. We present q2-feature-classifier (https://github.com/qiime2/q2-feature-classifier), a QIIME 2 plugin containing several novel machine-learning and alignment-based taxonomy classifiers that meet or exceed classification accuracy of existing methods. We evaluated and optimized several commonly used taxonomic classification methods (RDP, BLAST, BLAST+, UCLUST) and several new methods (a scikit-learn naive Bayes machine-learning classifier, and VSEARCH and SortMeRNA alignment-based methods). Conclusions. Our results illustrate the importance of parameter tuning for optimizing classifier performance, and we make explicit recommendations regarding parameter choices for a range of standard operating conditions. q2-feature-classifier and our evaluation framework, tax-credit, are both free, open-source, BSD-licensed packages available on GitHub.


2021 ◽  
Author(s):  
Seth Commichaux ◽  
Kiran Javkar ◽  
Harihara Subrahmaniam Muralidharan ◽  
Padmini Ramachandran ◽  
Andrea Ottesen ◽  
...  

Abstract BackgroundMicrobial eukaryotes are nearly ubiquitous in microbiomes on Earth and contribute to many integral ecological functions. Metagenomics is a proven tool for studying the microbial diversity, functions, and ecology of microbiomes, but has been underutilized for microeukaryotes due to the computational challenges they present. For taxonomic classification, the use of a eukaryotic marker gene database can improve the computational efficiency, precision and sensitivity. However, state-of-the-art tools which use marker gene databases implement universal thresholds for classification rather than dynamically learning the thresholds from the database structure, impacting the accuracy of the classification process.ResultsHere we introduce taxaTarget, a method for the taxonomic classification of microeukaryotes in metagenomic data. Using a database of eukaryotic marker genes and a supervised learning approach for training, we learned the discriminatory power and classification thresholds for each 20 amino acid region of each marker gene in our database. This approach provided improved sensitivity and precision compared to other state-of-the-art approaches, with rapid runtimes and low memory usage. Additionally, taxaTarget was better able to detect the presence of multiple closely related species as well as species with no representative sequences in the database. One of the greatest challenges faced during the development of taxaTarget was the general sparsity of available sequences for microeukaryotes. Several algorithms were implemented, including threshold padding, which effectively handled the missing training data and reduced classification errors. Using taxaTarget on metagenomes from human fecal microbiomes, a broader range of genera were detected, including multiple parasites that the other tested tools missed.ConclusionData-driven methods for learning classification thresholds from the structure of an input database can provide granular information about the discriminatory power of the sequences and improve the sensitivity and precision of classification. These methods will help facilitate a more comprehensive analysis of metagenomic data and expand our knowledge about the diverse eukaryotes in microbial communities.


2018 ◽  
Author(s):  
Nicholas A Bokulich ◽  
Benjamin D Kaehler ◽  
Jai Ram Rideout ◽  
Matthew Dillon ◽  
Evan Bolyen ◽  
...  

Background: Taxonomic classification of marker-gene sequences is an important step in microbiome analysis. Results: We present q2-feature-classifier ( https://github.com/qiime2/q2-feature-classifier ), a QIIME 2 plugin containing several novel machine-learning and alignment-based taxonomy classifiers that meet or exceed the accuracy of existing methods for marker-gene amplicon sequence classification. We evaluated and optimized several commonly used taxonomic classification methods (RDP, BLAST, UCLUST) and several new methods (a scikit-learn naive Bayes machine-learning classifier, and alignment-based taxonomy consensus methods of VSEARCH, BLAST+, and SortMeRNA) for classification of marker-gene amplicon sequence data. Conclusions: Our results illustrate the importance of parameter tuning for optimizing classifier performance, and we make recommendations regarding parameter choices for a range of standard operating conditions. q2-feature-classifier and our evaluation framework, tax-credit, are both free, open-source, BSD-licensed packages available on GitHub.


Geoderma ◽  
2003 ◽  
Vol 115 (1-2) ◽  
pp. 31-44 ◽  
Author(s):  
Min Zhang ◽  
Li Ma ◽  
Wenqing Li ◽  
Baocheng Chen ◽  
Jiwen Jia

BMC Genomics ◽  
2011 ◽  
Vol 12 (Suppl 4) ◽  
pp. S11 ◽  
Author(s):  
Anderson R Santos ◽  
Marcos A Santos ◽  
Jan Baumbach ◽  
John A McCulloch ◽  
Guilherme C Oliveira ◽  
...  

Genetics ◽  
2020 ◽  
Vol 217 (2) ◽  
Author(s):  
Verónica Mixão ◽  
Ester Saus ◽  
Teun Boekhout ◽  
Toni Gabaldón

Abstract Candida albicans is the most commonly reported species causing candidiasis. The taxonomic classification of C. albicans and related lineages is controversial, with Candida africana (syn. C. albicans var. africana) and Candida stellatoidea (syn. C. albicans var. stellatoidea) being considered different species or C. albicans varieties depending on the authors. Moreover, recent genomic analyses have suggested a shared hybrid origin of C. albicans and C. africana, but the potential parental lineages remain unidentified. Although the genomes of C. albicans and C. africana have been extensively studied, the genome of C. stellatoidea has not been sequenced so far. In order to get a better understanding of the evolution of the C. albicans clade, and to assess whether C. stellatoidea could represent one of the unknown C. albicans parental lineages, we sequenced C. stellatoidea type strain (CBS 1905). This genome was compared to that of C. albicans and of the closely related lineage C. africana. Our results show that, similarly to C. africana, C. stellatoidea descends from the same hybrid ancestor as other C. albicans strains and that it has undergone a parallel massive loss of heterozygosity.


2021 ◽  
Author(s):  
Rajan Saha Raju ◽  
Abdullah Al Nahid ◽  
Preonath Shuvo ◽  
Rashedul Islam

AbstractTaxonomic classification of viruses is a multi-class hierarchical classification problem, as taxonomic ranks (e.g., order, family and genus) of viruses are hierarchically structured and have multiple classes in each rank. Classification of biological sequences which are hierarchically structured with multiple classes is challenging. Here we developed a machine learning architecture, VirusTaxo, using a multi-class hierarchical classification by k-mer enrichment. VirusTaxo classifies DNA and RNA viruses to their taxonomic ranks using genome sequence. To assign taxonomic ranks, VirusTaxo extracts k-mers from genome sequence and creates bag-of-k-mers for each class in a rank. VirusTaxo uses a top-down hierarchical classification approach and accurately assigns the order, family and genus of a virus from the genome sequence. The average accuracies of VirusTaxo for DNA viruses are 99% (order), 98% (family) and 95% (genus) and for RNA viruses 97% (order), 96% (family) and 82% (genus). VirusTaxo can be used to detect taxonomy of novel viruses using full length genome or contig sequences.AvailabilityOnline version of VirusTaxo is available at https://omics-lab.com/virustaxo/.


2018 ◽  
Vol 13 (1) ◽  
Author(s):  
Gamaliel López-Leal ◽  
Fernanda Cornejo-Granados ◽  
Juan Manuel Hurtado-Ramírez ◽  
Alfredo Mendoza-Vargas ◽  
Adrian Ochoa-Leyva

Sign in / Sign up

Export Citation Format

Share Document