Short k-mer Abundance Profiles Yield Robust Machine Learning Features and Accurate Classifiers for RNA Viruses

Automated Annotation

AbstractHigh throughout sequencing technologies have greatly enabled the study of genomics, transcriptomics and metagenomics. Automated annotation and classification of the vast amounts of generated sequence data has become paramount for facilitating biological sciences. Genomes of viruses can be radically different from all life, both in terms of molecular structure and primary sequence. Alignment-based and profile-based searches are commonly employed for characterization of assembled viral contigs from high-throughput sequencing data. Recent attempts have highlighted the use of machine learning models for the task but these models rely entirely on DNA genomes and owing to the intrinsic genomic complexity of viruses, RNA viruses have gone completely overlooked. Here, we present a novel short k-mer based sequence scoring method that generates robust sequence information for training machine learning classifiers. We trained 18 classifiers for the task of distinguishing viral RNA from human transcripts. We challenged our models with very stringent testing protocols across different species and evaluated performance against BLASTn, BLASTx and HMMER3 searches. For clean sequence data retrieved from curated databases, our models display near perfect accuracy, outperforming all similar attempts previously reported. On de-novo assemblies of raw RNA-Seq data from cells subjected to Ebola virus, the area under the ROC curve varied from 0.6 to 0.86 depending on the software used for assembly. Our classifier was able to properly classify the majority of the false hits generated by BLAST and HMMER3 searches on the same data. The outstanding performance metrics of our model lays the groundwork for robust machine learning methods for the automated annotation of sequence data.Author SummaryIn this age of high-throughput sequencing, proper classification of copious amounts of sequence data remains to be a daunting challenge. Presently, sequence alignment methods are immediately assigned to the task. Owing to the selection forces of nature, there is considerable homology even between the sequences of different species which draws ambiguity to the results of alignment-based searches. Machine Learning methods are becoming more reliable for characterizing sequence data, but virus genomes are more variable than all forms of life and viruses with RNA-based genomes have gone overlooked in previous machine learning attempts. We designed a novel short k-mer based scoring criteria whereby a large number of highly robust numerical feature sets can be derived from sequence data. These features were able to accurately distinguish virus RNA from human transcripts with performance scores better than all previous reports. Our models were able to generalize well to distant species of viruses and mouse transcripts. The model correctly classifies the majority of false hits generated by current standard alignment tools. These findings strongly imply that this k-mer score based computational pipeline forges a highly informative, rich set of numerical machine learning features and similar pipelines can greatly advance the field of computational biology.

Comparison of machine learning methods for the classification of cardiovascular disease

Informatics in Medicine Unlocked ◽

10.1016/j.imu.2021.100606 ◽

2021 ◽

pp. 100606

Author(s):

Rachael Hagan ◽

Charles J. Gillan ◽

Fiona Mallett

Keyword(s):

Machine Learning ◽

Cardiovascular Disease ◽

Learning Methods ◽

Machine Learning methods for micro-FTIR imaging classification of human skin tumors

2021 SBFoton International Optics and Photonics Conference (SBFoton IOPC) ◽

10.1109/sbfotoniopc50774.2021.9461969 ◽

2021 ◽

Author(s):

Matheus del Valle ◽

Kleber Stancari ◽

Pedro Arthur Augusto de Castro ◽

Moises Oliveira dos Santos ◽

Denise Maria Zezell

Keyword(s):

Machine Learning ◽

Human Skin ◽

Skin Tumors ◽

Learning Methods ◽

Ftir Imaging ◽

Imaging Classification

Classification of HIV-1 Protease Inhibitors by Machine Learning Methods

ACS Omega ◽

10.1021/acsomega.8b01843 ◽

2018 ◽

Vol 3 (11) ◽

pp. 15837-15849 ◽

Cited By ~ 1

Author(s):

Yang Li ◽

Yujia Tian ◽

Zijian Qin ◽

Aixia Yan

Keyword(s):

Machine Learning ◽

Protease Inhibitors ◽

Learning Methods ◽

Hiv 1

Seeing It All: Evaluating Supervised Machine Learning Methods for the Classification of Diverse Otariid Behaviours

PLoS ONE ◽

10.1371/journal.pone.0166898 ◽

2016 ◽

Vol 11 (12) ◽

pp. e0166898 ◽

Cited By ~ 15

Author(s):

Monique A. Ladds ◽

Adam P. Thompson ◽

David J. Slip ◽

David P. Hocking ◽

Robert G. Harcourt

Keyword(s):

Machine Learning ◽

Supervised Machine Learning ◽

Learning Methods ◽

Advances in Intelligent Systems and Computing - Lecture Notes in Computational Intelligence and Decision Making ◽

Binary Classification of Fractal Time Series by Machine Learning Methods

10.1007/978-3-030-26474-1_49 ◽

2019 ◽

pp. 701-711 ◽

Cited By ~ 12

Author(s):

Lyudmyla Kirichenko ◽

Tamara Radivilova ◽

Vitalii Bulakh

Keyword(s):

Machine Learning ◽

Time Series ◽

Binary Classification ◽

Learning Methods ◽

Fractal Time

CLASSIFICATION OF MUNICIPAL ENERGY CONSUMPTION FACILITIES WITH THE USE OF MACHINE LEARNING METHODS

Electromechanical and energy saving systems ◽

10.30929/2072-2052.2020.2.50.43-51 ◽

2020 ◽

Vol 2 (50) ◽

pp. 43-51

Author(s):

A. Perekrest ◽

◽

V. Ogar ◽

O. Vovna ◽

◽

...

Keyword(s):

Machine Learning ◽

Energy Consumption ◽

Learning Methods ◽

Lecture Notes in Electrical Engineering - IAENG Transactions on Engineering Technologies ◽

Classification of Hyperspectral Images Using Machine Learning Methods

10.1007/978-94-007-6818-5_39 ◽

2013 ◽

pp. 555-569

Author(s):

Bolanle Tolulope Abe ◽

Oludayo O. Olugbara ◽

Tshilidzi Marwala

Keyword(s):

Machine Learning ◽

Hyperspectral Images ◽

Learning Methods ◽

Laser-induced breakdown spectroscopy for the classification of wood materials using machine learning methods combined with feature selection

Plasma Science and Technology ◽

10.1088/2058-6272/abf1ac ◽

2021 ◽

Author(s):

Xutai Cui ◽

Qianqian Wang ◽

Kai Wei ◽

Geer Teng ◽

Xiangjun Xu

Keyword(s):

Machine Learning ◽

Feature Selection ◽

Laser Induced Breakdown Spectroscopy ◽

Learning Methods ◽

Breakdown Spectroscopy ◽

Laser Induced Breakdown

Classification of beta‐site amyloid precursor protein cleaving enzyme 1 inhibitors by using machine learning methods

Chemical Biology & Drug Design ◽

10.1111/cbdd.13965 ◽

2021 ◽

Author(s):

Ravi Singh ◽

Ankit Ganeshpurkar ◽

Powsali Ghosh ◽

Ankit Vyankatrao Pokle ◽

Devendra Kumar ◽

...

Keyword(s):

Machine Learning ◽

Amyloid Precursor Protein ◽

Precursor Protein ◽

Learning Methods ◽

Prediction of Compound-Protein Interactions with Machine Learning Methods

Chemoinformatics and Advanced Machine Learning Perspectives ◽

10.4018/978-1-61520-911-8.ch016 ◽

2011 ◽

pp. 304-317

Author(s):

Yoshihiro Yamanishi ◽

Hisashi Kashima

Keyword(s):

Machine Learning ◽

Protein Interactions ◽

Chemical Structure ◽

Genomic Sequence ◽

Sequence Data ◽

Binary Classification ◽

Biological Data ◽

Supervised Machine Learning ◽

Learning Methods ◽

In silico prediction of compound-protein interactions from heterogeneous biological data is critical in the process of drug development. In this chapter the authors review several supervised machine learning methods to predict unknown compound-protein interactions from chemical structure and genomic sequence information simultaneously. The authors review several kernel-based algorithms from two different viewpoints: binary classification and dimension reduction. In the results, they demonstrate the usefulness of the methods on the prediction of drug-target interactions and ligand-protein interactions from chemical structure data and genomic sequence data.