Optimizing taxonomic classification of marker gene sequences

Optimizing taxonomic classification of marker gene amplicon sequences

10.7287/peerj.preprints.3208v2 ◽

2018 ◽

Cited By ~ 4

Author(s):

Nicholas A Bokulich ◽

Benjamin D Kaehler ◽

Jai Ram Rideout ◽

Matthew Dillon ◽

Evan Bolyen ◽

...

Keyword(s):

Machine Learning ◽

Sequence Data ◽

Marker Gene ◽

Parameter Tuning ◽

Operating Conditions ◽

Evaluation Framework ◽

Taxonomic Classification ◽

Consensus Methods ◽

Learning Classifier

Background: Taxonomic classification of marker-gene sequences is an important step in microbiome analysis. Results: We present q2-feature-classifier ( https://github.com/qiime2/q2-feature-classifier ), a QIIME 2 plugin containing several novel machine-learning and alignment-based taxonomy classifiers that meet or exceed the accuracy of existing methods for marker-gene amplicon sequence classification. We evaluated and optimized several commonly used taxonomic classification methods (RDP, BLAST, UCLUST) and several new methods (a scikit-learn naive Bayes machine-learning classifier, and alignment-based taxonomy consensus methods of VSEARCH, BLAST+, and SortMeRNA) for classification of marker-gene amplicon sequence data. Conclusions: Our results illustrate the importance of parameter tuning for optimizing classifier performance, and we make recommendations regarding parameter choices for a range of standard operating conditions. q2-feature-classifier and our evaluation framework, tax-credit, are both free, open-source, BSD-licensed packages available on GitHub.

Download Full-text

Optimizing taxonomic classification of marker gene amplicon sequences

10.7287/peerj.preprints.3208 ◽

2018 ◽

Author(s):

Nicholas A Bokulich ◽

Benjamin D Kaehler ◽

Jai Ram Rideout ◽

Matthew Dillon ◽

Evan Bolyen ◽

...

Keyword(s):

Machine Learning ◽

Sequence Data ◽

Marker Gene ◽

Parameter Tuning ◽

Operating Conditions ◽

Evaluation Framework ◽

Taxonomic Classification ◽

Consensus Methods ◽

Learning Classifier

Background: Taxonomic classification of marker-gene sequences is an important step in microbiome analysis. Results: We present q2-feature-classifier ( https://github.com/qiime2/q2-feature-classifier ), a QIIME 2 plugin containing several novel machine-learning and alignment-based taxonomy classifiers that meet or exceed the accuracy of existing methods for marker-gene amplicon sequence classification. We evaluated and optimized several commonly used taxonomic classification methods (RDP, BLAST, UCLUST) and several new methods (a scikit-learn naive Bayes machine-learning classifier, and alignment-based taxonomy consensus methods of VSEARCH, BLAST+, and SortMeRNA) for classification of marker-gene amplicon sequence data. Conclusions: Our results illustrate the importance of parameter tuning for optimizing classifier performance, and we make recommendations regarding parameter choices for a range of standard operating conditions. q2-feature-classifier and our evaluation framework, tax-credit, are both free, open-source, BSD-licensed packages available on GitHub.

Download Full-text

gammaBOriS: Identification and Taxonomic Classification of Origins of Replication in Gammaproteobacteria using Motif-based Machine Learning

Scientific Reports ◽

10.1038/s41598-020-63424-7 ◽

2020 ◽

Vol 10 (1) ◽

Author(s):

Theodor Sperlea ◽

Lea Muth ◽

Roman Martin ◽

Christoph Weigel ◽

Torsten Waldminghaus ◽

...

Keyword(s):

Machine Learning ◽

Taxonomic Classification ◽

Origins Of Replication

Download Full-text

Accurate classification of pediatric colonic IBD subtype using a random forest machine learning classifier

Journal of Pediatric Gastroenterology and Nutrition ◽

10.1097/mpg.0000000000002956 ◽

2020 ◽

Vol Publish Ahead of Print ◽

Author(s):

Jasbir Dhaliwal ◽

Lauren Erdman ◽

Erik Drysdal ◽

Firas Rinawi ◽

Jennifer Muir ◽

...

Keyword(s):

Machine Learning ◽

Random Forest ◽

Learning Classifier

Download Full-text

Classification of EEG Signals Using a Genetic-Based Machine Learning Classifier

2007 29th Annual International Conference of the IEEE Engineering in Medicine and Biology Society ◽

10.1109/iembs.2007.4352990 ◽

2007 ◽

Cited By ~ 12

Author(s):

B. T. Skinner ◽

H. T. Nguyen ◽

D. K. Liu

Keyword(s):

Machine Learning ◽

Eeg Signals ◽

Learning Classifier

Download Full-text

Analysis Of Machine Learning Classifier Performance In Adding Custom Gestures To The Leap Motion

10.15368/theses.2016.132 ◽

2016 ◽

Cited By ~ 1

Author(s):

Eric Yun

Keyword(s):

Machine Learning ◽

Leap Motion ◽

Learning Classifier ◽

Classifier Performance

Download Full-text

A standardized, extensible framework for optimizing classification improves marker-gene taxonomic assignments

10.7287/peerj.preprints.934v1 ◽

2015 ◽

Cited By ~ 1

Author(s):

Nicholas A Bokulich ◽

Jai Ram Rideout ◽

Evguenia Kopylova ◽

Evan Bolyen ◽

Jessica Patnode ◽

...

Keyword(s):

Marker Gene ◽

Amplicon Sequencing ◽

Evaluation Framework ◽

Taxonomic Resolution ◽

Marker Genes ◽

Classification Methods ◽

New Methods ◽

Taxonomic Assignments ◽

Different Levels

Background: Taxonomic classification of marker-gene (i.e., amplicon) sequences represents an important step for molecular identification of microorganisms. Results: We present three advances in our ability to assign and interpret taxonomic classifications of short marker gene sequences: two new methods for taxonomy assignment, which reduce runtime up to two-fold and achieve high precision genus-level assignments; an evaluation of classification methods that highlights differences in performance with different marker genes and at different levels of taxonomic resolution; and an extensible framework for evaluating and optimizing new classification methods, which we hope will serve as a model for standardized and reproducible bioinformatics methods evaluations. Conclusions: Our new methods are accessible in QIIME 1.9.0, and our evaluation framework will support ongoing optimization of classification methods to complement rapidly evolving short-amplicon sequencing and bioinformatics technologies. Static versions of all of the analysis notebooks generated with this framework, which contain all code and analysis results, can be viewed at http://bit.ly/srta-010.

Download Full-text

Machine learning using intrinsic genomic signatures for rapid classification of novel pathogens: COVID-19 case study

10.1101/2020.02.03.932350 ◽

2020 ◽

Cited By ~ 10

Author(s):

Gurjit S. Randhawa ◽

Maximillian P.M. Soltysiak ◽

Hadi El Roz ◽

Camila P.E. de Souza ◽

Kathleen A. Hill ◽

...

Keyword(s):

Machine Learning ◽

Death Rate ◽

Genomic Sequence ◽

Sequence Data ◽

Rank Correlation ◽

Taxonomic Classification ◽

Supervised Machine Learning ◽

Biological Knowledge ◽

Alignment Free

AbstractAs of February 20, 2020, the 2019 novel coronavirus (renamed to COVID-19) spread to 30 countries with 2130 deaths and more than 75500 confirmed cases. COVID-19 is being compared to the infamous SARS coronavirus, which resulted, between November 2002 and July 2003, in 8098 confirmed cases worldwide with a 9.6% death rate and 774 deaths. Though COVID-19 has a death rate of 2.8% as of 20 February, the 75752 confirmed cases in a few weeks (December 8, 2019 to February 20, 2020) are alarming, with cases likely being under-reported given the comparatively longer incubation period. Such outbreaks demand elucidation of taxonomic classification and origin of the virus genomic sequence, for strategic planning, containment, and treatment. This paper identifies an intrinsic COVID-19 genomic signature and uses it together with a machine learning-based alignment-free approach for an ultra-fast, scalable, and highly accurate classification of whole COVID-19 genomes. The proposed method combines supervised machine learning with digital signal processing for genome analyses, augmented by a decision tree approach to the machine learning component, and a Spearman’s rank correlation coefficient analysis for result validation. These tools are used to analyze a large dataset of over 5000 unique viral genomic sequences, totalling 61.8 million bp. Our results support a hypothesis of a bat origin and classify COVID-19 as Sarbecovirus, within Betacoronavirus. Our method achieves high levels of classification accuracy and discovers the most relevant relationships among over 5,000 viral genomes within a few minutes, ab initio, using raw DNA sequence data alone, and without any specialized biological knowledge, training, gene or genome annotations. This suggests that, for novel viral and pathogen genome sequences, this alignment-free whole-genome machine-learning approach can provide a reliable real-time option for taxonomic classification.

Download Full-text

TaxaTarget: Fast, Sensitive, and Precise Classification of Microeukaryotes in Metagenomic Data

10.21203/rs.3.rs-1186624/v1 ◽

2021 ◽

Author(s):

Seth Commichaux ◽

Kiran Javkar ◽

Harihara Subrahmaniam Muralidharan ◽

Padmini Ramachandran ◽

Andrea Ottesen ◽

...

Keyword(s):

State Of The Art ◽

Marker Gene ◽

Taxonomic Classification ◽

Discriminatory Power ◽

Training Data ◽

Metagenomic Data ◽

Marker Genes ◽

Amino Acid Region ◽

Database Structure

Abstract BackgroundMicrobial eukaryotes are nearly ubiquitous in microbiomes on Earth and contribute to many integral ecological functions. Metagenomics is a proven tool for studying the microbial diversity, functions, and ecology of microbiomes, but has been underutilized for microeukaryotes due to the computational challenges they present. For taxonomic classification, the use of a eukaryotic marker gene database can improve the computational efficiency, precision and sensitivity. However, state-of-the-art tools which use marker gene databases implement universal thresholds for classification rather than dynamically learning the thresholds from the database structure, impacting the accuracy of the classification process.ResultsHere we introduce taxaTarget, a method for the taxonomic classification of microeukaryotes in metagenomic data. Using a database of eukaryotic marker genes and a supervised learning approach for training, we learned the discriminatory power and classification thresholds for each 20 amino acid region of each marker gene in our database. This approach provided improved sensitivity and precision compared to other state-of-the-art approaches, with rapid runtimes and low memory usage. Additionally, taxaTarget was better able to detect the presence of multiple closely related species as well as species with no representative sequences in the database. One of the greatest challenges faced during the development of taxaTarget was the general sparsity of available sequences for microeukaryotes. Several algorithms were implemented, including threshold padding, which effectively handled the missing training data and reduced classification errors. Using taxaTarget on metagenomes from human fecal microbiomes, a broader range of genera were detected, including multiple parasites that the other tested tools missed.ConclusionData-driven methods for learning classification thresholds from the structure of an input database can provide granular information about the discriminatory power of the sequences and improve the sensitivity and precision of classification. These methods will help facilitate a more comprehensive analysis of metagenomic data and expand our knowledge about the diverse eukaryotes in microbial communities.

Download Full-text

Machine-Learning Classifiers for Malware Detection Using Data Features

Journal of ICT Research and Applications ◽

10.5614/itbj.ict.res.appl.2021.15.3.5 ◽

2021 ◽

Vol 15 (3) ◽

pp. 265-290

Author(s):

Saleh Abdulaziz Habtor ◽

Ahmed Haidarah Hasan Dahah

Keyword(s):

Machine Learning ◽

Decision Making ◽

Data Processing ◽

Evaluation Framework ◽

Artificial Learning ◽

Processing Module ◽

N Gram ◽

Using Data ◽

Complete Process

The spread of ransomware has risen exponentially over the past decade, causing huge financial damage to multiple organizations. Various anti-ransomware firms have suggested methods for preventing malware threats. The growing pace, scale and sophistication of malware provide the anti-malware industry with more challenges. Recent literature indicates that academics and anti-virus organizations have begun to use artificial learning as well as fundamental modeling techniques for the research and identification of malware. Orthodox signature-based anti-virus programs struggle to identify unfamiliar malware and track new forms of malware. In this study, a malware evaluation framework focused on machine learning was adopted that consists of several modules: dataset compiling in two separate classes (malicious and benign software), file disassembly, data processing, decision making, and updated malware identification. The data processing module uses grey images, functions for importing and Opcode n-gram to remove malware functionality. The decision making module detects malware and recognizes suspected malware. Different classifiers were considered in the research methodology for the detection and classification of malware. Its effectiveness was validated on the basis of the accuracy of the complete process.

Download Full-text