A standardized, extensible framework for optimizing classification improves marker-gene taxonomic assignments

10.7287/peerj.preprints.934 ◽

2015 ◽

Cited By ~ 1

Author(s):

Nicholas A Bokulich ◽

Jai Ram Rideout ◽

Evguenia Kopylova ◽

Evan Bolyen ◽

Jessica Patnode ◽

...

Keyword(s):

Marker Gene ◽

Amplicon Sequencing ◽

Evaluation Framework ◽

Taxonomic Resolution ◽

Marker Genes ◽

Classification Methods ◽

New Methods ◽

Taxonomic Assignments ◽

Different Levels

Background: Taxonomic classification of marker-gene (i.e., amplicon) sequences represents an important step for molecular identification of microorganisms. Results: We present three advances in our ability to assign and interpret taxonomic classifications of short marker gene sequences: two new methods for taxonomy assignment, which reduce runtime up to two-fold and achieve high-precision genus-level assignments; an evaluation of classification methods that highlights differences in performance with different marker genes and at different levels of taxonomic resolution; and an extensible framework for evaluating and optimizing new classification methods, which we hope will serve as a model for standardized and reproducible bioinformatics methods evaluations. Conclusions: Our new methods are accessible in QIIME 1.9.0, and our evaluation framework will support ongoing optimization of classification methods to complement rapidly evolving short-amplicon sequencing and bioinformatics technologies. Static versions of all of the analysis notebooks generated with this framework, which contain all code and analysis results, can be viewed at http://bit.ly/srta-012 .

Download Full-text

A standardized, extensible framework for optimizing classification improves marker-gene taxonomic assignments

10.7287/peerj.preprints.934v2 ◽

2015 ◽

Cited By ~ 6

Author(s):

Nicholas A Bokulich ◽

Jai Ram Rideout ◽

Evguenia Kopylova ◽

Evan Bolyen ◽

Jessica Patnode ◽

...

Keyword(s):

Marker Gene ◽

Amplicon Sequencing ◽

Evaluation Framework ◽

Taxonomic Resolution ◽

Marker Genes ◽

Classification Methods ◽

New Methods ◽

Taxonomic Assignments ◽

Different Levels

Background: Taxonomic classification of marker-gene (i.e., amplicon) sequences represents an important step for molecular identification of microorganisms. Results: We present three advances in our ability to assign and interpret taxonomic classifications of short marker gene sequences: two new methods for taxonomy assignment, which reduce runtime up to two-fold and achieve high-precision genus-level assignments; an evaluation of classification methods that highlights differences in performance with different marker genes and at different levels of taxonomic resolution; and an extensible framework for evaluating and optimizing new classification methods, which we hope will serve as a model for standardized and reproducible bioinformatics methods evaluations. Conclusions: Our new methods are accessible in QIIME 1.9.0, and our evaluation framework will support ongoing optimization of classification methods to complement rapidly evolving short-amplicon sequencing and bioinformatics technologies. Static versions of all of the analysis notebooks generated with this framework, which contain all code and analysis results, can be viewed at http://bit.ly/srta-012 .

Download Full-text

Optimizing taxonomic classification of marker gene amplicon sequences

10.7287/peerj.preprints.3208v2 ◽

2018 ◽

Cited By ~ 4

Author(s):

Nicholas A Bokulich ◽

Benjamin D Kaehler ◽

Jai Ram Rideout ◽

Matthew Dillon ◽

Evan Bolyen ◽

...

Keyword(s):

Machine Learning ◽

Sequence Data ◽

Marker Gene ◽

Parameter Tuning ◽

Operating Conditions ◽

Evaluation Framework ◽

Taxonomic Classification ◽

Consensus Methods ◽

Learning Classifier

Background: Taxonomic classification of marker-gene sequences is an important step in microbiome analysis. Results: We present q2-feature-classifier ( https://github.com/qiime2/q2-feature-classifier ), a QIIME 2 plugin containing several novel machine-learning and alignment-based taxonomy classifiers that meet or exceed the accuracy of existing methods for marker-gene amplicon sequence classification. We evaluated and optimized several commonly used taxonomic classification methods (RDP, BLAST, UCLUST) and several new methods (a scikit-learn naive Bayes machine-learning classifier, and alignment-based taxonomy consensus methods of VSEARCH, BLAST+, and SortMeRNA) for classification of marker-gene amplicon sequence data. Conclusions: Our results illustrate the importance of parameter tuning for optimizing classifier performance, and we make recommendations regarding parameter choices for a range of standard operating conditions. q2-feature-classifier and our evaluation framework, tax-credit, are both free, open-source, BSD-licensed packages available on GitHub.

Download Full-text

Optimizing taxonomic classification of marker gene sequences

10.7287/peerj.preprints.3208v1 ◽

2017 ◽

Cited By ~ 4

Author(s):

Nicholas A Bokulich ◽

Benjamin D Kaehler ◽

Jai Ram Rideout ◽

Matthew Dillon ◽

Evan Bolyen ◽

...

Keyword(s):

Machine Learning ◽

Marker Gene ◽

Parameter Tuning ◽

Operating Conditions ◽

Evaluation Framework ◽

Taxonomic Classification ◽

Gene Sequences ◽

Learning Classifier ◽

Classifier Performance

Background. Taxonomic classification of marker-gene sequences is an important step in microbiome analysis. Results. We present q2-feature-classifier (https://github.com/qiime2/q2-feature-classifier), a QIIME 2 plugin containing several novel machine-learning and alignment-based taxonomy classifiers that meet or exceed classification accuracy of existing methods. We evaluated and optimized several commonly used taxonomic classification methods (RDP, BLAST, BLAST+, UCLUST) and several new methods (a scikit-learn naive Bayes machine-learning classifier, and VSEARCH and SortMeRNA alignment-based methods). Conclusions. Our results illustrate the importance of parameter tuning for optimizing classifier performance, and we make explicit recommendations regarding parameter choices for a range of standard operating conditions. q2-feature-classifier and our evaluation framework, tax-credit, are both free, open-source, BSD-licensed packages available on GitHub.

Download Full-text

TaxaTarget: Fast, Sensitive, and Precise Classification of Microeukaryotes in Metagenomic Data

10.21203/rs.3.rs-1186624/v1 ◽

2021 ◽

Author(s):

Seth Commichaux ◽

Kiran Javkar ◽

Harihara Subrahmaniam Muralidharan ◽

Padmini Ramachandran ◽

Andrea Ottesen ◽

...

Keyword(s):

State Of The Art ◽

Marker Gene ◽

Taxonomic Classification ◽

Discriminatory Power ◽

Training Data ◽

Metagenomic Data ◽

Marker Genes ◽

Amino Acid Region ◽

Database Structure

Abstract BackgroundMicrobial eukaryotes are nearly ubiquitous in microbiomes on Earth and contribute to many integral ecological functions. Metagenomics is a proven tool for studying the microbial diversity, functions, and ecology of microbiomes, but has been underutilized for microeukaryotes due to the computational challenges they present. For taxonomic classification, the use of a eukaryotic marker gene database can improve the computational efficiency, precision and sensitivity. However, state-of-the-art tools which use marker gene databases implement universal thresholds for classification rather than dynamically learning the thresholds from the database structure, impacting the accuracy of the classification process.ResultsHere we introduce taxaTarget, a method for the taxonomic classification of microeukaryotes in metagenomic data. Using a database of eukaryotic marker genes and a supervised learning approach for training, we learned the discriminatory power and classification thresholds for each 20 amino acid region of each marker gene in our database. This approach provided improved sensitivity and precision compared to other state-of-the-art approaches, with rapid runtimes and low memory usage. Additionally, taxaTarget was better able to detect the presence of multiple closely related species as well as species with no representative sequences in the database. One of the greatest challenges faced during the development of taxaTarget was the general sparsity of available sequences for microeukaryotes. Several algorithms were implemented, including threshold padding, which effectively handled the missing training data and reduced classification errors. Using taxaTarget on metagenomes from human fecal microbiomes, a broader range of genera were detected, including multiple parasites that the other tested tools missed.ConclusionData-driven methods for learning classification thresholds from the structure of an input database can provide granular information about the discriminatory power of the sequences and improve the sensitivity and precision of classification. These methods will help facilitate a more comprehensive analysis of metagenomic data and expand our knowledge about the diverse eukaryotes in microbial communities.

Download Full-text

Characterization of Shallow Whole-Metagenome Shotgun Sequencing as a High-Accuracy and Low-Cost Method by Complicated Mock Microbiomes

Frontiers in Microbiology ◽

10.3389/fmicb.2021.678319 ◽

2021 ◽

Vol 12 ◽

Author(s):

Wenyi Xu ◽

Tianda Chen ◽

Yuwei Pei ◽

Hao Guo ◽

Zhuanyu Li ◽

...

Keyword(s):

Large Scale ◽

Low Cost ◽

Bacterial Species ◽

Amplicon Sequencing ◽

Shotgun Sequencing ◽

Taxonomic Resolution ◽

Rrna Gene ◽

Cost Efficient ◽

Taxonomic Assignments

Characterization of the bacterial composition and functional repertoires of microbiome samples is the most common application of metagenomics. Although deep whole-metagenome shotgun sequencing (WMS) provides high taxonomic resolution, it is generally cost-prohibitive for large longitudinal investigations. Until now, 16S rRNA gene amplicon sequencing (16S) has been the most widely used approach and usually cooperates with WMS to achieve cost-efficiency. However, the accuracy of 16S results and its consistency with WMS data have not been fully elaborated, especially by complicated microbiomes with defined compositional information. Here, we constructed two complex artificial microbiomes, which comprised more than 60 human gut bacterial species with even or varied abundance. Utilizing real fecal samples and mock communities, we provided solid evidence demonstrating that 16S results were of poor consistency with WMS data, and its accuracy was not satisfactory. In contrast, shallow whole-metagenome shotgun sequencing (shallow WMS, S-WMS) with a sequencing depth of 1 Gb provided outputs that highly resembled WMS data at both genus and species levels and presented much higher accuracy taxonomic assignments and functional predictions than 16S, thereby representing a better and cost-efficient alternative to 16S for large-scale microbiome studies.

Download Full-text

Amplicon sequencing of single-copy protein-coding genes reveals accurate diversity for sequence-discrete microbiome populations

10.1101/2021.10.22.465537 ◽

2021 ◽

Author(s):

Chengfeng Yang ◽

Qinzhi Su ◽

Min Tang ◽

Shiqi Luo ◽

Hao Zheng ◽

...

Keyword(s):

Sequence Similarity ◽

Amplicon Sequencing ◽

Single Copy ◽

Small Subunit ◽

Taxonomic Resolution ◽

Ecological Niches ◽

Marker Genes ◽

Protein Coding ◽

Gene Markers ◽

Protein Coding Genes

An in-depth understanding of microbial function and the division of ecological niches requires accurate delineation and identification of microbes at a fine taxonomic resolution. Microbial phylotypes are typically defined using a 97% small subunit (16S) rRNA threshold. However, increasing evidence has demonstrated the ubiquitous presence of taxonomic units of distinct functions within phylotypes. These so-called sequence-discrete populations (SDPs) have used to be mainly delineated by disjunct sequence similarity at the whole-genome level. However, gene markers that could accurately identify and quantify SDPs are lacking in microbial community studies. Here we developed a pipeline to screen single-copy protein-coding genes that could accurately characterize SDP diversity via amplicon sequencing of microbial communities. Fifteen candidate marker genes were evaluated using three criteria (extent of sequence divergence, phylogenetic accuracy, and conservation of primer regions) and the selected genes were subject to test the efficiency in differentiating SDPs within Gilliamella, a core honeybee gut microbial phylotype, as a proof-of-concept. The results showed that the 16S V4 region failed to report accurate SDP diversities due to low taxonomic resolution and changing copy numbers. In contrast, the single-copy genes recommended by our pipeline were able to successfully quantify Gilliamella SDPs for both mock samples and honeybee guts, with results highly consistent with those of metagenomics. The pipeline developed in this study is expected to identify single-copy protein coding genes capable of accurately quantifying diverse bacterial communities at the SDP level.

Download Full-text

ASAP 2: a pipeline and web server to analyze marker gene amplicon sequencing data automatically and consistently

BMC Bioinformatics ◽

10.1186/s12859-021-04555-0 ◽

2022 ◽

Vol 23 (1) ◽

Author(s):

Renmao Tian ◽

Behzad Imanian

Keyword(s):

Statistical Tests ◽

Marker Gene ◽

Web Server ◽

Amplicon Sequencing ◽

Marker Genes ◽

Sequence Variant ◽

Complex Data ◽

Sequencing Analysis ◽

Sequencing Data ◽

Link Type

Abstract Background Amplicon sequencing of marker genes such as 16S rDNA have been widely used to survey and characterize microbial community. However, the complex data analyses have required many interfering manual steps often leading to inconsistencies in results. Results Here, we have developed a pipeline, amplicon sequence analysis pipeline 2 (ASAP 2), to automate and glide through the processes without the usual manual inspections and user’s interference, for instance, in the detection of barcode orientation, selection of high-quality region of reads, and determination of resampling depth and many more. The pipeline integrates all the analytical processes such as importing data, demultiplexing, summarizing read profiles, trimming quality, denoising, removing chimeric sequences and making the feature table among others. The pipeline accepts multiple file formats as input including multiplexed or demultiplexed, paired-end or single-end, barcode inside or outside and raw or intermediate data (e.g. feature table). The outputs include taxonomic classification, alpha/beta diversity, community composition, ordination analysis and statistical tests. ASAP 2 supports merging multiple sequencing runs which helps integrate and compare data from different sources (public databases and collaborators). Conclusions Our pipeline minimizes hands-on interference and runs amplicon sequence variant (ASV)-based amplicon sequencing analysis automatically and consistently. Our web server assists researchers that have no access to high performance computer (HPC) or have limited bioinformatics skills. The pipeline and web server can be accessed at https://github.com/tianrenmaogithub/asap2 and https://hts.iit.edu/asap2, respectively.

Download Full-text

Optimizing taxonomic classification of marker gene amplicon sequences

10.7287/peerj.preprints.3208 ◽

2018 ◽

Author(s):

Nicholas A Bokulich ◽

Benjamin D Kaehler ◽

Jai Ram Rideout ◽

Matthew Dillon ◽

Evan Bolyen ◽

...

Keyword(s):

Machine Learning ◽

Sequence Data ◽

Marker Gene ◽

Parameter Tuning ◽

Operating Conditions ◽

Evaluation Framework ◽

Taxonomic Classification ◽

Consensus Methods ◽

Learning Classifier

Background: Taxonomic classification of marker-gene sequences is an important step in microbiome analysis. Results: We present q2-feature-classifier ( https://github.com/qiime2/q2-feature-classifier ), a QIIME 2 plugin containing several novel machine-learning and alignment-based taxonomy classifiers that meet or exceed the accuracy of existing methods for marker-gene amplicon sequence classification. We evaluated and optimized several commonly used taxonomic classification methods (RDP, BLAST, UCLUST) and several new methods (a scikit-learn naive Bayes machine-learning classifier, and alignment-based taxonomy consensus methods of VSEARCH, BLAST+, and SortMeRNA) for classification of marker-gene amplicon sequence data. Conclusions: Our results illustrate the importance of parameter tuning for optimizing classifier performance, and we make recommendations regarding parameter choices for a range of standard operating conditions. q2-feature-classifier and our evaluation framework, tax-credit, are both free, open-source, BSD-licensed packages available on GitHub.

Download Full-text

Machine Learning Methods in E-mail Spam Classification

Studia Informatica ◽

10.34739/si.2019.23.04 ◽

2020 ◽

pp. 57-76

Author(s):

Piotr Świtalski ◽

Mateusz Kopówka

Keyword(s):

Machine Learning ◽

Social Media ◽

Classification Problem ◽

The Internet ◽

Classification Methods ◽

Spam Filtering ◽

Machine Learning Methods ◽

New Methods ◽

E Mail

Increasing number of unwanted e-mails has influence on users’ security in the Internet. Today spam e-mails can store potential malicious messages which e.g. can redirect user to fake sites. These messages recently appeared in social media. Filtering of this content is important due to minimize financial and branding costs. Traditional methods of spam filtering cannot be sufficient for present threats. We required new methods for constructing more dependable and robust antispam filters. Machine learning recently becomes very popular technique in classification methods. It has been successfully used in spam classification. In this paper we present some methods of machine learning for spam detecting. We would also like to introduce ways to solve the spam classification problem. We show that these methods can be useful in classification of malicious messages. We also compared developed methods and presented results in the experimental section.

Download Full-text