Assessing Pathogens for Natural versus Laboratory Origins Using Genomic Data and Machine Learning

AbstractPathogen genomic data is increasingly important in investigations of infectious disease outbreaks. The objective of this study is to develop methods for using large-scale genomic data to determine the type of the environment an outbreak pathogen came from. Specifically, this study focuses on assessing whether an outbreak strain came from a natural environment or experienced substantial laboratory culturing. The approach uses phylogenetic analyses and machine learning to identify DNA changes that are characteristic of laboratory culturing. The analysis methods include parallelized sequence read alignment, variant identification, phylogenetic tree construction, ancestral state reconstruction, semi-supervised classification, and random forests. These methods were applied to 902 Salmonella enterica serovar Typhimurium genomes from the NCBI Sequence Read Archive database. The analyses identified candidate signatures of laboratory culturing that are highly consistent with genes identified in published laboratory passage studies. In particular, the analysis identified mutations in rpoS, hfq, rfb genes, acrB, and rbsR as strong signatures of laboratory culturing. In leave-one-out cross-validation, the classifier had an area under the receiver operating characteristic (ROC) curve of 0.89 for strains from two laboratory reference sets collected in the 1940’s and 1980’s. The classifier was also used to assess laboratory culturing in foodborne and laboratory acquired outbreak strains closely related to laboratory reference strain serovar Typhimurium 14028. The classifier detected some evidence of laboratory culturing on the phylogeny branch leading to this clade, suggesting all of these strains may have a common ancestor that experienced laboratory culturing. Together, these results suggest that phylogenetic analysis and machine learning could be used to assess whether pathogens collected from patients are naturally occurring or have been extensively cultured in laboratories. The data analysis methods can be applied to any bacterial pathogen species, and could be adapted to assess viral pathogens and other types of source environments.

Download Full-text

Cross-species prediction of essential genes in insects through machine learning and sequence-based attributes

10.1101/2021.03.15.433440 ◽

2021 ◽

Author(s):

Giovanni Marques de Castro ◽

Zandora Hastenreiter ◽

Thiago Augusto Silva Monteiro ◽

Francisco Pereira Lobo

Keyword(s):

Machine Learning ◽

Large Scale ◽

Phenotypic Diversity ◽

Expression Profiles ◽

Genomic Data ◽

Essential Genes ◽

Insect Species ◽

Rational Approach ◽

Gradient Boosting ◽

Better Than

Background: Insects are organisms with a vast phenotypic diversity and key ecological roles. Several insect species also have medical, agricultural and veterinary importance as parasites and vectors of diseases. Therefore, strategies to identify potential essential genes in insects may reduce the resources needed to find molecular players in central processes of insect biology. Furthermore, the detection of essential genes that occur only in specific groups within insects, such as lineages containing insect pests and vectors, may provide a more rational approach to select essential genes for the development of insecticides with fewer off-target effects. However, most predictors of essential genes in multicellular eukaryotes using machine learning rely on expensive and laborious experimental data to be used as gene features, such as gene expression profiles or protein-protein interactions. This information is not available for the vast majority of insect species, which prevents this strategy to be effectively used to survey genomic data from non-model insect species for candidate essential genes. Here we present a general machine learning strategy to predict essential genes in insects using only sequence-based attributes (statistical and physicochemical data). We validate our strategy using genomic data for the two insect species where large-scale gene essentiality data is available: Drosophila melanogaster (fruit fly, Diptera) and Tribolium castaneum (red flour beetle, Coleoptera). Results: We used publicly available databases plus a thorough literature review to obtain databases of essential and non-essential genes for D. melanogaster and T. castaneum, and proceeded by computing sequence-based attributes that were used to train statistical models (Random Forest and Gradient Boosting Trees) to predict essential genes for each species. We demonstrated that both models are capable of distinguishing essential from non-essential genes significantly better than zero rule classifiers. Furthermore, models trained in one insect species are also capable of predicting essential genes in the other species significantly better than expected by chance. Finally, we demonstrated that the D. melanogaster model can distinguish between essential and non-essential T. castaneum genes with no known homologs in the fly significantly better than a zero-rule model, demonstrating that it is possible to use models trained using fly genes to predict lineage-specific essential genes in a phylogenetically distant insect order. Conclusions: Here we report, to the best of our knowledge, the development and validation of the first general predictor of essential genes in insects using sequence-based attributes that can, in principle, be computed for any insect species where genomic information is available. The code and data used to predict essential genes in insects are freely available at https://github.com/g1o/GeneEssentiality/.

Download Full-text

Identification of stem cells from large cell populations with topological scoring

Molecular Omics ◽

10.1039/d0mo00039f ◽

2021 ◽

Author(s):

Mihaela E. Sardiu ◽

Andrew C. Box ◽

Jeffrey S. Haug ◽

Michael P. Washburn

Keyword(s):

Machine Learning ◽

Stem Cells ◽

Large Scale ◽

Topological Analysis ◽

Large Cell ◽

Cell Populations ◽

Analysis Methods

Machine learning and topological analysis methods are becoming increasingly used on various large-scale omics datasets.

Download Full-text

Large-Scale Data Learning Method for Anomaly Detection using Machine Learning for Monitoring Vibration in Vehicle Equipment

IEEJ Transactions on Industry Applications ◽

10.1541/ieejias.140.480 ◽

2020 ◽

Vol 140 (6) ◽

pp. 480-487

Author(s):

Minoru Kondo

Keyword(s):

Machine Learning ◽

Anomaly Detection ◽

Large Scale ◽

Learning Method ◽

Large Scale Data ◽

Scale Data

Download Full-text

Coded Computing: Mitigating Fundamental Bottlenecks in Large-Scale Distributed Computing and Machine Learning

10.1561/9781680837056 ◽

2020 ◽

Author(s):

Songze Li ◽

Salman Avestimehr

Keyword(s):

Machine Learning ◽

Distributed Computing ◽

Large Scale

Download Full-text

Evolution of Metastable Structures in Bimetallic Catalysts from Microscopy and Machine-Learning Molecular Dynamics

10.26434/chemrxiv.11811660.v1 ◽

2020 ◽

Author(s):

Jin Soo Lim ◽

Jonathan Vandermause ◽

Matthijs A. van Spronsen ◽

Albert Musaelian ◽

Christopher R. O’Connor ◽

...

Keyword(s):

Machine Learning ◽

Molecular Dynamics ◽

Large Scale ◽

Materials Science ◽

Complete Characterization ◽

Layer By Layer ◽

Surface Restructuring ◽

Metastable Structures ◽

Mechanistic Investigation ◽

Underlying Mechanisms

Restructuring of interface plays a crucial role in materials science and heterogeneous catalysis. Bimetallic systems, in particular, often adopt very different composition and morphology at surfaces compared to the bulk. For the first time, we reveal a detailed atomistic picture of the long-timescale restructuring of Pd deposited on Ag, using microscopy, spectroscopy, and novel simulation methods. Encapsulation of Pd by Ag always precedes layer-by-layer dissolution of Pd, resulting in significant Ag migration out of the surface and extensive vacancy pits. These metastable structures are of vital catalytic importance, as Ag-encapsulated Pd remains much more accessible to reactants than bulk-dissolved Pd. The underlying mechanisms are uncovered by performing fast and large-scale machine-learning molecular dynamics, followed by our newly developed method for complete characterization of atomic surface restructuring events. Our approach is broadly applicable to other multimetallic systems of interest and enables the previously impractical mechanistic investigation of restructuring dynamics.

Download Full-text

Epigenetic Target Prediction with Accurate Machine Learning Models

10.26434/chemrxiv.13522313 ◽

2021 ◽

Author(s):

Norberto Sánchez-Cruz ◽

Jose L. Medina-Franco

Keyword(s):

Machine Learning ◽

Small Molecules ◽

Predictive Models ◽

Large Scale ◽

Target Prediction ◽

Quantitative Measure ◽

Learning Models ◽

Discovery Research ◽

Drug Discovery Research ◽

Machine Learning Models

<p>Epigenetic targets are a significant focus for drug discovery research, as demonstrated by the eight approved epigenetic drugs for treatment of cancer and the increasing availability of chemogenomic data related to epigenetics. This data represents a large amount of structure-activity relationships that has not been exploited thus far for the development of predictive models to support medicinal chemistry efforts. Herein, we report the first large-scale study of 26318 compounds with a quantitative measure of biological activity for 55 protein targets with epigenetic activity. Through a systematic comparison of machine learning models trained on molecular fingerprints of different design, we built predictive models with high accuracy for the epigenetic target profiling of small molecules. The models were thoroughly validated showing mean precisions up to 0.952 for the epigenetic target prediction task. Our results indicate that the herein reported models have considerable potential to identify small molecules with epigenetic activity. Therefore, our results were implemented as freely accessible and easy-to-use web application.</p>

Download Full-text

Efficient Image Retrieval approach for Large-scale Chest X Ray data using Hand-Crafted Features and Machine Learning Algorithms

International Journal of Computer Sciences and Engineering ◽

10.26438/ijcse/v6i11.890896 ◽

2018 ◽

Vol 6 (11) ◽

pp. 890-896

Author(s):

Irene Getzi S ◽

D. Christopher Durairaj ◽

V Joseph Raj

Keyword(s):

Machine Learning ◽

Image Retrieval ◽

Large Scale ◽

Learning Algorithms ◽

Machine Learning Algorithms ◽

X Ray ◽

Chest X Ray

Download Full-text

Adsorption Isotherm Predictions for Multiple Molecules in MOFs Using the Same Deep Learning Model

10.26434/chemrxiv.9894224.v1 ◽

2019 ◽

Author(s):

Ryther Anderson ◽

Achay Biong ◽

Diego Gómez-Gualdrón

Keyword(s):

Neural Network ◽

Machine Learning ◽

Molecular Simulation ◽

Large Scale ◽

Learning Model ◽

Operating Conditions ◽

Small Subset ◽

Screening Methods ◽

Large Set ◽

Metal Organic

<div>Tailoring the structure and chemistry of metal-organic frameworks (MOFs) enables the manipulation of their adsorption properties to suit specific energy and environmental applications. As there are millions of possible MOFs (with tens of thousands already synthesized), molecular simulation, such as grand canonical Monte Carlo (GCMC), has frequently been used to rapidly evaluate the adsorption performance of a large set of MOFs. This allows subsequent experiments to focus only on a small subset of the most promising MOFs. In many instances, however, even molecular simulation becomes prohibitively time consuming, underscoring the need for alternative screening methods, such as machine learning, to precede molecular simulation efforts. In this study, as a proof of concept, we trained a neural network as the first example of a machine learning model capable of predicting full adsorption isotherms of different molecules not included in the training of the model. To achieve this, we trained our neural network only on alchemical species, represented only by their geometry and force field parameters, and used this neural network to predict the loadings of real adsorbates. We focused on predicting room temperature adsorption of small (one- and two-atom) molecules relevant to chemical separations. Namely, argon, krypton, xenon, methane, ethane, and nitrogen. However, we also observed surprisingly promising predictions for more complex molecules, whose properties are outside the range spanned by the alchemical adsorbates. Prediction accuracies suitable for large-scale screening were achieved using simple MOF (e.g. geometric properties and chemical moieties), and adsorbate (e.g. forcefield parameters and geometry) descriptors. Our results illustrate a new philosophy of training that opens the path towards development of machine learning models that can predict the adsorption loading of any new adsorbate at any new operating conditions in any new MOF.</div>

Download Full-text

Recent Progress in Machine Learning-based Prediction of Peptide Activity for Drug Discovery

Current Topics in Medicinal Chemistry ◽

10.2174/1568026619666190122151634 ◽

2019 ◽

Vol 19 (1) ◽

pp. 4-16 ◽

Cited By ~ 6

Author(s):

Qihui Wu ◽

Hanzhong Ke ◽

Dongli Li ◽

Qi Wang ◽

Jiansong Fang ◽

...

Keyword(s):

Machine Learning ◽

Drug Discovery ◽

Large Scale ◽

Recent Progress ◽

High Specificity ◽

Learning Approaches ◽

Anticancer Peptides ◽

The Past ◽

Traditional Approaches ◽

Large Scale Screening

Over the past decades, peptide as a therapeutic candidate has received increasing attention in drug discovery, especially for antimicrobial peptides (AMPs), anticancer peptides (ACPs) and antiinflammatory peptides (AIPs). It is considered that the peptides can regulate various complex diseases which are previously untouchable. In recent years, the critical problem of antimicrobial resistance drives the pharmaceutical industry to look for new therapeutic agents. Compared to organic small drugs, peptide- based therapy exhibits high specificity and minimal toxicity. Thus, peptides are widely recruited in the design and discovery of new potent drugs. Currently, large-scale screening of peptide activity with traditional approaches is costly, time-consuming and labor-intensive. Hence, in silico methods, mainly machine learning approaches, for their accuracy and effectiveness, have been introduced to predict the peptide activity. In this review, we document the recent progress in machine learning-based prediction of peptides which will be of great benefit to the discovery of potential active AMPs, ACPs and AIPs.

Download Full-text

Statistical and machine learning models for optimizing energy in parallel applications

The International Journal of High Performance Computing Applications ◽

10.1177/1094342019842915 ◽

2019 ◽

Vol 33 (6) ◽

pp. 1079-1097 ◽

Cited By ~ 2

Author(s):

Mark Endrei ◽

Chao Jin ◽

Minh Ngoc Dinh ◽

David Abramson ◽

Heidi Poxon ◽

...

Keyword(s):

Machine Learning ◽

Energy Efficiency ◽

High Performance ◽

Large Scale ◽

Energy Use ◽

Parallel Applications ◽

Learning Models ◽

Trade Off ◽

Time Required ◽

Machine Learning Models

Rising power costs and constraints are driving a growing focus on the energy efficiency of high performance computing systems. The unique characteristics of a particular system and workload and their effect on performance and energy efficiency are typically difficult for application users to assess and to control. Settings for optimum performance and energy efficiency can also diverge, so we need to identify trade-off options that guide a suitable balance between energy use and performance. We present statistical and machine learning models that only require a small number of runs to make accurate Pareto-optimal trade-off predictions using parameters that users can control. We study model training and validation using several parallel kernels and more complex workloads, including Algebraic Multigrid (AMG), Large-scale Atomic Molecular Massively Parallel Simulator, and Livermore Unstructured Lagrangian Explicit Shock Hydrodynamics. We demonstrate that we can train the models using as few as 12 runs, with prediction error of less than 10%. Our AMG results identify trade-off options that provide up to 45% improvement in energy efficiency for around 10% performance loss. We reduce the sample measurement time required for AMG by 90%, from 13 h to 74 min.

Download Full-text