Predicting Antimicrobial Resistance Using Conserved Genes

Mapping Intimacies ◽

10.1101/2020.04.29.068254 ◽

2020 ◽

Cited By ~ 1

Author(s):

Marcus Nguyen ◽

Robert Olson ◽

Maulik Shukla ◽

Margo VanOeffelen ◽

James J. Davis

Keyword(s):

Machine Learning ◽

Antimicrobial Resistance ◽

Sequence Data ◽

Core Gene ◽

Error Rates ◽

Machine Learning Algorithms ◽

Major Error ◽

Building Models ◽

Conserved Genes ◽

Core Genes

AbstractA growing number of studies have shown that machine learning algorithms can be used to accurately predict antimicrobial resistance (AMR) phenotypes from bacterial sequence data. In these studies, models are typically trained using input features derived from comprehensive sets of known AMR genes or whole genome sequences. However, it can be difficult to determine whether genomes and their corresponding sets of AMR genes are complete when sequencing contaminated or metagenomic samples. In this study, we explore the possibility of using incomplete genome sequence data to predict AMR phenotypes. Machine learning models were built from randomly-selected sets of core genes that are held in common among the members of a species, and the AMR-conferring genes were removed based on their protein annotations. For Klebsiella pneumoniae, Mycobacterium tuberculosis, Salmonella enterica, and Staphylococcus aureus, we report that it is possible to classify susceptible and resistant phenotypes with average F1 scores ranging from 0.80-0.89 with as few as 100 conserved non-AMR genes, with very major error rates ranging from 0.11-0.23 and major error rates ranging from 0.10-0.20. Models built from core genes have predictive power in the cases where the primary AMR mechanism results from SNPs or horizontal gene transfer. By randomly sampling non-overlapping sets of core genes for use in these models, we show that F1 scores and error rates are stable and have little variance between replicates. Potential biases from strain-specific SNPs, phylogenetic sampling, and imbalances in the phylogenetic distribution of susceptible and resistant strains do not appear to have an impact on this result. Although these small core gene models have lower accuracies and higher error rates than models built from the corresponding assembled genomes, the results suggest that sufficient variation exists in the core non-AMR genes of a species for predicting AMR phenotypes. Overall this study suggests that building models from conserved genes may be a potentially useful strategy for predicting AMR phenotypes when genomes are incomplete.

Download Full-text

Amino Acid k-mer Feature Extraction for Quantitative Antimicrobial Resistance (AMR) Prediction by Machine Learning and Model Interpretation for Biological Insights

Biology ◽

10.3390/biology9110365 ◽

2020 ◽

Vol 9 (11) ◽

pp. 365

Author(s):

Taha ValizadehAslani ◽

Zhengqiao Zhao ◽

Bahrad A. Sokhansanj ◽

Gail L. Rosen

Keyword(s):

Machine Learning ◽

Feature Extraction ◽

Amino Acid ◽

Computational Complexity ◽

Antimicrobial Resistance ◽

Learning Algorithms ◽

Extraction Methods ◽

Machine Learning Algorithms ◽

Model Interpretation ◽

New Feature

Machine learning algorithms can learn mechanisms of antimicrobial resistance from the data of DNA sequence without any a priori information. Interpreting a trained machine learning algorithm can be exploited for validating the model and obtaining new information about resistance mechanisms. Different feature extraction methods, such as SNP calling and counting nucleotide k-mers have been proposed for presenting DNA sequences to the model. However, there are trade-offs between interpretability, computational complexity and accuracy for different feature extraction methods. In this study, we have proposed a new feature extraction method, counting amino acid k-mers or oligopeptides, which provides easier model interpretation compared to counting nucleotide k-mers and reaches the same or even better accuracy in comparison with different methods. Additionally, we have trained machine learning algorithms using different feature extraction methods and compared the results in terms of accuracy, model interpretability and computational complexity. We have built a new feature selection pipeline for extraction of important features so that new AMR determinants can be discovered by analyzing these features. This pipeline allows the construction of models that only use a small number of features and can predict resistance accurately.

Download Full-text

Review on the Application of Machine Learning Algorithms in the Sequence Data Mining of DNA

Frontiers in Bioengineering and Biotechnology ◽

10.3389/fbioe.2020.01032 ◽

2020 ◽

Vol 8 ◽

Author(s):

Aimin Yang ◽

Wei Zhang ◽

Jiahao Wang ◽

Ke Yang ◽

Yang Han ◽

...

Keyword(s):

Machine Learning ◽

Data Mining ◽

Sequence Data ◽

Learning Algorithms ◽

Machine Learning Algorithms

Download Full-text

CNN based Image Classification Model

International Journal of Innovative Technology and Exploring Engineering - Special Issue ◽

10.35940/ijitee.k1225.09811s19 ◽

2019 ◽

Vol 8 (11S) ◽

pp. 1106-1114

Keyword(s):

Neural Network ◽

Machine Learning ◽

Image Classification ◽

Error Rates ◽

Machine Learning Algorithms ◽

Classification Model ◽

The Internet ◽

Network Algorithms ◽

Media Companies ◽

Automatic Tagging

Images are the fastest growing content, they contribute significantly to the amount of data generated on the internet every day. Image classification is a challenging problem that social media companies work on vigorously to enhance the user’s experience with the interface. The recent advances in the field of machine learning and computer vision enables personalized suggestions and automatic tagging of images. Convolutional neural network is a hot research topic these days in the field of machine learning. With the help of immensely dense labelled data available on the internet the networks can be trained to recognize the differentiating features among images under the same label. New neural network algorithms are developed frequently that outperform the state-of-art machine learning algorithms. Recent algorithms have managed to produce error rates as low as 3.1%. In this paper the architecture of important CNN algorithms that have gained attention are discussed, analyzed and compared and the concept of transfer learning is used to classify different breeds of dogs..

Download Full-text

A Hybrid Meta-Learner Technique for Credit Scoring of Banks’ Customers

Engineering, Technology & Applied Science Research ◽

10.48084/etasr.1361 ◽

2017 ◽

Vol 7 (5) ◽

pp. 2073-2082 ◽

Cited By ~ 1

Author(s):

A. G. Armaki ◽

M. F. Fallah ◽

M. Alborzi ◽

A. Mohammadzadeh

Keyword(s):

Machine Learning ◽

Hybrid Model ◽

Credit Scoring ◽

Clustering Algorithms ◽

Real Life ◽

Ensemble Methods ◽

Scoring Systems ◽

Error Rates ◽

Machine Learning Algorithms ◽

Machine Learning Techniques

Financial institutions are exposed to credit risk due to issuance of consumer loans. Thus, developing reliable credit scoring systems is very crucial for them. Since, machine learning techniques have demonstrated their applicability and merit, they have been extensively used in credit scoring literature. Recent studies concentrating on hybrid models through merging various machine learning algorithms have revealed compelling results. There are two types of hybridization methods namely traditional and ensemble methods. This study combines both of them and comes up with a hybrid meta-learner model. The structure of the model is based on the traditional hybrid model of ‘classification + clustering’ in which the stacking ensemble method is employed in the classification part. Moreover, this paper compares several versions of the proposed hybrid model by using various combinations of classification and clustering algorithms. Hence, it helps us to identify which hybrid model can achieve the best performance for credit scoring purposes. Using four real-life credit datasets, the experimental results show that the model of (KNN-NN-SVMPSO)-(DL)-(DBSCAN) delivers the highest prediction accuracy and the lowest error rates.

Download Full-text

iLearn: an integrated platform and meta-learner for feature engineering, machine-learning analysis and modeling of DNA, RNA and protein sequence data

Briefings in Bioinformatics ◽

10.1093/bib/bbz041 ◽

2019 ◽

Vol 21 (3) ◽

pp. 1047-1057 ◽

Cited By ~ 57

Author(s):

Zhen Chen ◽

Pei Zhao ◽

Fuyi Li ◽

Tatiana T Marquez-Lago ◽

André Leier ◽

...

Keyword(s):

Machine Learning ◽

Dimensionality Reduction ◽

Sequence Data ◽

Machine Learning Algorithms ◽

User Friendliness ◽

Data Set ◽

Protein Sequence Data ◽

Learning Analysis ◽

High Throughput Manner ◽

Online Web

Abstract With the explosive growth of biological sequences generated in the post-genomic era, one of the most challenging problems in bioinformatics and computational biology is to computationally characterize sequences, structures and functions in an efficient, accurate and high-throughput manner. A number of online web servers and stand-alone tools have been developed to address this to date; however, all these tools have their limitations and drawbacks in terms of their effectiveness, user-friendliness and capacity. Here, we present iLearn, a comprehensive and versatile Python-based toolkit, integrating the functionality of feature extraction, clustering, normalization, selection, dimensionality reduction, predictor construction, best descriptor/model selection, ensemble learning and results visualization for DNA, RNA and protein sequences. iLearn was designed for users that only want to upload their data set and select the functions they need calculated from it, while all necessary procedures and optimal settings are completed automatically by the software. iLearn includes a variety of descriptors for DNA, RNA and proteins, and four feature output formats are supported so as to facilitate direct output usage or communication with other computational tools. In total, iLearn encompasses 16 different types of feature clustering, selection, normalization and dimensionality reduction algorithms, and five commonly used machine-learning algorithms, thereby greatly facilitating feature analysis and predictor construction. iLearn is made freely available via an online web server and a stand-alone toolkit.

Download Full-text

Assessing the Calibration in Toxicological in Vitro Models with Conformal Prediction

10.21203/rs.3.rs-220364/v1 ◽

2021 ◽

Author(s):

Andrea Morger ◽

Fredrik Svensson ◽

Staffan Arvidsson McShane ◽

Niharika Gauraha ◽

Ulf Norinder ◽

...

Keyword(s):

Machine Learning ◽

Test Data ◽

Cross Validation ◽

In Vitro Models ◽

Error Rates ◽

Machine Learning Algorithms ◽

Calibration Data ◽

Improvement Strategy ◽

Conformal Prediction

Abstract Machine learning methods are widely used in drug discovery and toxicity prediction. While showing overall good performance in cross-validation studies, their predictive power (often) drops in cases where the query samples have drifted from the training data’s descriptor space. Thus, the assumption for applying machine learning algorithms, that training and test data stem from the same distribution, might not always be fulfilled. In this work, conformal prediction is used to assess the calibration of the models. Deviations from the expected error may indicate that training and test data originate from different distributions. Exemplified on the Tox21 datasets, composed of chronologically released Tox21Train, Tox21Test and Tox21Score subsets, we observed that while internally valid models could be trained using cross-validation on Tox21Train, predictions on the external Tox21Score data resulted in higher error rates than expected. To improve the prediction on the external sets, a strategy exchanging the calibration set with more recent data, such as Tox21Test, has successfully been introduced. We conclude that conformal prediction can be used to diagnose data drifts and other issues relating to model calibration. The proposed improvement strategy — exchanging the calibration data only — is convenient as it does not require retraining of the underlying model.

Download Full-text

Assessing the calibration in toxicological in vitro models with conformal prediction

Journal of Cheminformatics ◽

10.1186/s13321-021-00511-5 ◽

2021 ◽

Vol 13 (1) ◽

Author(s):

Andrea Morger ◽

Fredrik Svensson ◽

Staffan Arvidsson McShane ◽

Niharika Gauraha ◽

Ulf Norinder ◽

...

Keyword(s):

Machine Learning ◽

Test Data ◽

Cross Validation ◽

In Vitro Models ◽

Error Rates ◽

Machine Learning Algorithms ◽

Calibration Data ◽

Improvement Strategy ◽

Conformal Prediction

AbstractMachine learning methods are widely used in drug discovery and toxicity prediction. While showing overall good performance in cross-validation studies, their predictive power (often) drops in cases where the query samples have drifted from the training data’s descriptor space. Thus, the assumption for applying machine learning algorithms, that training and test data stem from the same distribution, might not always be fulfilled. In this work, conformal prediction is used to assess the calibration of the models. Deviations from the expected error may indicate that training and test data originate from different distributions. Exemplified on the Tox21 datasets, composed of chronologically released Tox21Train, Tox21Test and Tox21Score subsets, we observed that while internally valid models could be trained using cross-validation on Tox21Train, predictions on the external Tox21Score data resulted in higher error rates than expected. To improve the prediction on the external sets, a strategy exchanging the calibration set with more recent data, such as Tox21Test, has successfully been introduced. We conclude that conformal prediction can be used to diagnose data drifts and other issues related to model calibration. The proposed improvement strategy—exchanging the calibration data only—is convenient as it does not require retraining of the underlying model.

Download Full-text

Predicting gene ontology annotations from sequence data using kernel-based machine learning algorithms

Proceedings. 2004 IEEE Computational Systems Bioinformatics Conference, 2004. CSB 2004. ◽

10.1109/csb.2004.1332485 ◽

2004 ◽

Author(s):

J.J. Ward ◽

J.S. Sodhi ◽

B.F. Buxton ◽

D.T. Jones

Keyword(s):

Machine Learning ◽

Gene Ontology ◽

Sequence Data ◽

Learning Algorithms ◽

Machine Learning Algorithms

Download Full-text

Single-Cell-Genomics-Facilitated Read Binning of Candidate Phylum EM19 Genomes from Geothermal Spring Metagenomes

Applied and Environmental Microbiology ◽

10.1128/aem.03140-15 ◽

2015 ◽

Vol 82 (4) ◽

pp. 992-1003 ◽

Cited By ~ 16

Author(s):

Eric D. Becraft ◽

Jeremy A. Dodsworth ◽

Senthil K. Murugapiran ◽

J. Ingemar Ohlsson ◽

Brandon R. Briggs ◽

...

Keyword(s):

Machine Learning ◽

Single Cell ◽

Great Basin ◽

Hot Springs ◽

Nitrogen Sources ◽

Core Gene ◽

Machine Learning Algorithms ◽

Metabolic Reconstruction ◽

Data Set ◽

Content Type

ABSTRACTThe vast majority of microbial life remains uncatalogued due to the inability to cultivate these organisms in the laboratory. This “microbial dark matter” represents a substantial portion of the tree of life and of the populations that contribute to chemical cycling in many ecosystems. In this work, we leveraged an existing single-cell genomic data set representing the candidate bacterial phylum “Calescamantes” (EM19) to calibrate machine learning algorithms and define metagenomic bins directly from pyrosequencing reads derived from Great Boiling Spring in the U.S. Great Basin. Compared to other assembly-based methods, taxonomic binning with a read-based machine learning approach yielded final assemblies with the highest predicted genome completeness of any method tested. Read-first binning subsequently was used to extractCalescamantesbins from all metagenomes with abundantCalescamantespopulations, including metagenomes from Octopus Spring and Bison Pool in Yellowstone National Park and Gongxiaoshe Spring in Yunnan Province, China. Metabolic reconstruction suggests thatCalescamantesare heterotrophic, facultative anaerobes, which can utilize oxidized nitrogen sources as terminal electron acceptors for respiration in the absence of oxygen and use proteins as their primary carbon source. Despite their phylogenetic divergence, the geographically separateCalescamantespopulations were highly similar in their predicted metabolic capabilities and core gene content, respiring O2, or oxidized nitrogen species for energy conservation in distant but chemically similar hot springs.

Download Full-text

Using machine learning to guide targeted and locally-tailored empiric antibiotic prescribing in a children's hospital in Cambodia

Wellcome Open Research ◽

10.12688/wellcomeopenres.14847.1 ◽

2018 ◽

Vol 3 ◽

pp. 131 ◽

Cited By ~ 15

Author(s):

Mathupanee Oonsivilai ◽

Yin Mo ◽

Nantasit Luangasanatip ◽

Yoel Lubell ◽

Thyl Miliya ◽

...

Keyword(s):

Machine Learning ◽

Random Forest ◽

Antimicrobial Resistance ◽

Blood Culture ◽

Learning Algorithms ◽

Empiric Therapy ◽

Patient Data ◽

Machine Learning Algorithms ◽

Random Forest Method ◽

Empiric Antibiotic

Background: Early and appropriate empiric antibiotic treatment of patients suspected of having sepsis is associated with reduced mortality. The increasing prevalence of antimicrobial resistance reduces the efficacy of empiric therapy guidelines derived from population data. This problem is particularly severe for children in developing country settings. We hypothesized that by applying machine learning approaches to readily collect patient data, it would be possible to obtain individualized predictions for targeted empiric antibiotic choices. Methods and Findings: We analysed blood culture data collected from a 100-bed children's hospital in North-West Cambodia between February 2013 and January 2016. Clinical, demographic and living condition information was captured with 35 independent variables. Using these variables, we used a suite of machine learning algorithms to predict Gram stains and whether bacterial pathogens could be treated with common empiric antibiotic regimens: i) ampicillin and gentamicin; ii) ceftriaxone; iii) none of the above. 243 patients with bloodstream infections were available for analysis. We found that the random forest method had the best predictive performance overall as assessed by the area under the receiver operating characteristic curve (AUC). The random forest method gave an AUC of 0.80 (95%CI 0.66-0.94) for predicting susceptibility to ceftriaxone, 0.74 (0.59-0.89) for susceptibility to ampicillin and gentamicin, 0.85 (0.70-1.00) for susceptibility to neither, and 0.71 (0.57-0.86) for Gram stain result. Most important variables for predicting susceptibility were time from admission to blood culture, patient age, hospital versus community-acquired infection, and age-adjusted weight score. Conclusions: Applying machine learning algorithms to patient data that are readily available even in resource-limited hospital settings can provide highly informative predictions on antibiotic susceptibilities to guide appropriate empiric antibiotic therapy. When used as a decision support tool, such approaches have the potential to improve targeting of empiric therapy, patient outcomes and reduce the burden of antimicrobial resistance.

Download Full-text