Optimizing taxonomic classification of marker-gene amplicon sequences with QIIME 2’s q2-feature-classifier plugin

Background: Taxonomic classification of marker-gene sequences is an important step in microbiome analysis. Results: We present q2-feature-classifier ( https://github.com/qiime2/q2-feature-classifier ), a QIIME 2 plugin containing several novel machine-learning and alignment-based taxonomy classifiers that meet or exceed the accuracy of existing methods for marker-gene amplicon sequence classification. We evaluated and optimized several commonly used taxonomic classification methods (RDP, BLAST, UCLUST) and several new methods (a scikit-learn naive Bayes machine-learning classifier, and alignment-based taxonomy consensus methods of VSEARCH, BLAST+, and SortMeRNA) for classification of marker-gene amplicon sequence data. Conclusions: Our results illustrate the importance of parameter tuning for optimizing classifier performance, and we make recommendations regarding parameter choices for a range of standard operating conditions. q2-feature-classifier and our evaluation framework, tax-credit, are both free, open-source, BSD-licensed packages available on GitHub.

Download Full-text

Optimizing taxonomic classification of marker gene sequences

10.7287/peerj.preprints.3208v1 ◽

2017 ◽

Cited By ~ 4

Author(s):

Nicholas A Bokulich ◽

Benjamin D Kaehler ◽

Jai Ram Rideout ◽

Matthew Dillon ◽

Evan Bolyen ◽

...

Keyword(s):

Machine Learning ◽

Marker Gene ◽

Parameter Tuning ◽

Operating Conditions ◽

Evaluation Framework ◽

Taxonomic Classification ◽

Gene Sequences ◽

Learning Classifier ◽

Classifier Performance

Background. Taxonomic classification of marker-gene sequences is an important step in microbiome analysis. Results. We present q2-feature-classifier (https://github.com/qiime2/q2-feature-classifier), a QIIME 2 plugin containing several novel machine-learning and alignment-based taxonomy classifiers that meet or exceed classification accuracy of existing methods. We evaluated and optimized several commonly used taxonomic classification methods (RDP, BLAST, BLAST+, UCLUST) and several new methods (a scikit-learn naive Bayes machine-learning classifier, and VSEARCH and SortMeRNA alignment-based methods). Conclusions. Our results illustrate the importance of parameter tuning for optimizing classifier performance, and we make explicit recommendations regarding parameter choices for a range of standard operating conditions. q2-feature-classifier and our evaluation framework, tax-credit, are both free, open-source, BSD-licensed packages available on GitHub.

Download Full-text

TaxaTarget: Fast, Sensitive, and Precise Classification of Microeukaryotes in Metagenomic Data

10.21203/rs.3.rs-1186624/v1 ◽

2021 ◽

Author(s):

Seth Commichaux ◽

Kiran Javkar ◽

Harihara Subrahmaniam Muralidharan ◽

Padmini Ramachandran ◽

Andrea Ottesen ◽

...

Keyword(s):

State Of The Art ◽

Marker Gene ◽

Taxonomic Classification ◽

Discriminatory Power ◽

Training Data ◽

Metagenomic Data ◽

Marker Genes ◽

Amino Acid Region ◽

Database Structure

Abstract BackgroundMicrobial eukaryotes are nearly ubiquitous in microbiomes on Earth and contribute to many integral ecological functions. Metagenomics is a proven tool for studying the microbial diversity, functions, and ecology of microbiomes, but has been underutilized for microeukaryotes due to the computational challenges they present. For taxonomic classification, the use of a eukaryotic marker gene database can improve the computational efficiency, precision and sensitivity. However, state-of-the-art tools which use marker gene databases implement universal thresholds for classification rather than dynamically learning the thresholds from the database structure, impacting the accuracy of the classification process.ResultsHere we introduce taxaTarget, a method for the taxonomic classification of microeukaryotes in metagenomic data. Using a database of eukaryotic marker genes and a supervised learning approach for training, we learned the discriminatory power and classification thresholds for each 20 amino acid region of each marker gene in our database. This approach provided improved sensitivity and precision compared to other state-of-the-art approaches, with rapid runtimes and low memory usage. Additionally, taxaTarget was better able to detect the presence of multiple closely related species as well as species with no representative sequences in the database. One of the greatest challenges faced during the development of taxaTarget was the general sparsity of available sequences for microeukaryotes. Several algorithms were implemented, including threshold padding, which effectively handled the missing training data and reduced classification errors. Using taxaTarget on metagenomes from human fecal microbiomes, a broader range of genera were detected, including multiple parasites that the other tested tools missed.ConclusionData-driven methods for learning classification thresholds from the structure of an input database can provide granular information about the discriminatory power of the sequences and improve the sensitivity and precision of classification. These methods will help facilitate a more comprehensive analysis of metagenomic data and expand our knowledge about the diverse eukaryotes in microbial communities.

Download Full-text

Optimizing taxonomic classification of marker gene amplicon sequences

10.7287/peerj.preprints.3208 ◽

2018 ◽

Author(s):

Nicholas A Bokulich ◽

Benjamin D Kaehler ◽

Jai Ram Rideout ◽

Matthew Dillon ◽

Evan Bolyen ◽

...

Keyword(s):

Machine Learning ◽

Sequence Data ◽

Marker Gene ◽

Parameter Tuning ◽

Operating Conditions ◽

Evaluation Framework ◽

Taxonomic Classification ◽

Consensus Methods ◽

Learning Classifier

Background: Taxonomic classification of marker-gene sequences is an important step in microbiome analysis. Results: We present q2-feature-classifier ( https://github.com/qiime2/q2-feature-classifier ), a QIIME 2 plugin containing several novel machine-learning and alignment-based taxonomy classifiers that meet or exceed the accuracy of existing methods for marker-gene amplicon sequence classification. We evaluated and optimized several commonly used taxonomic classification methods (RDP, BLAST, UCLUST) and several new methods (a scikit-learn naive Bayes machine-learning classifier, and alignment-based taxonomy consensus methods of VSEARCH, BLAST+, and SortMeRNA) for classification of marker-gene amplicon sequence data. Conclusions: Our results illustrate the importance of parameter tuning for optimizing classifier performance, and we make recommendations regarding parameter choices for a range of standard operating conditions. q2-feature-classifier and our evaluation framework, tax-credit, are both free, open-source, BSD-licensed packages available on GitHub.

Download Full-text

Genetic characteristics and taxonomic classification of Fimic Anthrosols in China

Geoderma ◽

10.1016/s0016-7061(03)00073-9 ◽

2003 ◽

Vol 115 (1-2) ◽

pp. 31-44 ◽

Cited By ~ 6

Author(s):

Min Zhang ◽

Li Ma ◽

Wenqing Li ◽

Baocheng Chen ◽

Jiwen Jia

Keyword(s):

Taxonomic Classification ◽

Genetic Characteristics

Download Full-text

Taxonomic classification of metagenomic sequences from Relative Abundance Index profiles using deep learning

Biomedical Signal Processing and Control ◽

10.1016/j.bspc.2021.102539 ◽

2021 ◽

Vol 67 ◽

pp. 102539

Author(s):

Meryem Altın Karagöz ◽

O. Ufuk Nalbantoglu

Keyword(s):

Deep Learning ◽

Relative Abundance ◽

Taxonomic Classification ◽

Abundance Index

Download Full-text

A singular value decomposition approach for improved taxonomic classification of biological sequences

BMC Genomics ◽

10.1186/1471-2164-12-s4-s11 ◽

2011 ◽

Vol 12 (Suppl 4) ◽

pp. S11 ◽

Cited By ~ 2

Author(s):

Anderson R Santos ◽

Marcos A Santos ◽

Jan Baumbach ◽

John A McCulloch ◽

Guilherme C Oliveira ◽

...

Keyword(s):

Singular Value Decomposition ◽

Singular Value ◽

Taxonomic Classification ◽

Biological Sequences ◽

Decomposition Approach ◽

Value Decomposition

Download Full-text

Extreme diversification driven by parallel events of massive loss of heterozygosity in the hybrid lineage of Candida albicans

Genetics ◽

10.1093/genetics/iyaa004 ◽

2020 ◽

Vol 217 (2) ◽

Cited By ~ 1

Author(s):

Verónica Mixão ◽

Ester Saus ◽

Teun Boekhout ◽

Toni Gabaldón

Keyword(s):

Candida Albicans ◽

Loss Of Heterozygosity ◽

Type Strain ◽

Taxonomic Classification ◽

Hybrid Origin ◽

Genomic Analyses ◽

Candida Africana

Abstract Candida albicans is the most commonly reported species causing candidiasis. The taxonomic classification of C. albicans and related lineages is controversial, with Candida africana (syn. C. albicans var. africana) and Candida stellatoidea (syn. C. albicans var. stellatoidea) being considered different species or C. albicans varieties depending on the authors. Moreover, recent genomic analyses have suggested a shared hybrid origin of C. albicans and C. africana, but the potential parental lineages remain unidentified. Although the genomes of C. albicans and C. africana have been extensively studied, the genome of C. stellatoidea has not been sequenced so far. In order to get a better understanding of the evolution of the C. albicans clade, and to assess whether C. stellatoidea could represent one of the unknown C. albicans parental lineages, we sequenced C. stellatoidea type strain (CBS 1905). This genome was compared to that of C. albicans and of the closely related lineage C. africana. Our results show that, similarly to C. africana, C. stellatoidea descends from the same hybrid ancestor as other C. albicans strains and that it has undergone a parallel massive loss of heterozygosity.

Download Full-text

VirusTaxo: Taxonomic classification of virus genome using multi-class hierarchical classification by k-mer enrichment

10.1101/2021.04.29.442004 ◽

2021 ◽

Author(s):

Rajan Saha Raju ◽

Abdullah Al Nahid ◽

Preonath Shuvo ◽

Rashedul Islam

Keyword(s):

Genome Sequence ◽

Rna Viruses ◽

Hierarchical Classification ◽

Classification Problem ◽

Virus Genome ◽

Taxonomic Classification ◽

Dna Viruses ◽

Dna And Rna ◽

Full Length Genome

AbstractTaxonomic classification of viruses is a multi-class hierarchical classification problem, as taxonomic ranks (e.g., order, family and genus) of viruses are hierarchically structured and have multiple classes in each rank. Classification of biological sequences which are hierarchically structured with multiple classes is challenging. Here we developed a machine learning architecture, VirusTaxo, using a multi-class hierarchical classification by k-mer enrichment. VirusTaxo classifies DNA and RNA viruses to their taxonomic ranks using genome sequence. To assign taxonomic ranks, VirusTaxo extracts k-mers from genome sequence and creates bag-of-k-mers for each class in a rank. VirusTaxo uses a top-down hierarchical classification approach and accurately assigns the order, family and genus of a virus from the genome sequence. The average accuracies of VirusTaxo for DNA viruses are 99% (order), 98% (family) and 95% (genus) and for RNA viruses 97% (order), 96% (family) and 82% (genus). VirusTaxo can be used to detect taxonomy of novel viruses using full length genome or contig sequences.AvailabilityOnline version of VirusTaxo is available at https://omics-lab.com/virustaxo/.

Download Full-text

Functional and taxonomic classification of a greenhouse water drain metagenome

Standards in Genomic Sciences ◽

10.1186/s40793-018-0326-y ◽

2018 ◽

Vol 13 (1) ◽

Cited By ~ 1

Author(s):

Gamaliel López-Leal ◽

Fernanda Cornejo-Granados ◽

Juan Manuel Hurtado-Ramírez ◽

Alfredo Mendoza-Vargas ◽

Adrian Ochoa-Leyva

Keyword(s):

Taxonomic Classification ◽

Water Drain

Download Full-text