Resolving Protein Conformational Plasticity and Substrate Binding Through the Lens of Machine-Learning

Mapping Intimacies ◽

10.1101/2022.01.07.475334 ◽

2022 ◽

Author(s):

Navjeet Ahalawat ◽

Jagannath Mondal

Keyword(s):

Machine Learning ◽

Ligand Binding ◽

Protein Conformation ◽

A Priori ◽

Specific Protein ◽

Dynamics Simulation ◽

Supervised Machine Learning ◽

Receptor Protein ◽

Recognition Process ◽

Conformational Plasticity

A long-standing target in elucidating the biomolecular recognition process is the identification of binding-competent conformations of the receptor protein. However, protein conformational plasticity and the stochastic nature of the recognition processes often preclude the assignment of a specific protein conformation to an individual ligand-bound pose. In particular, we consider multi-microsecond long Molecular dynamics simulation trajectories of ligand recognition process in solvent-inaccessible cavity of two archtypal systems: L99A mutant of T4 Lysozyme and Cytochrome P450. We first show that if the substrate-recognition occurs via long-lived intermediate, the protein conformations can be automatically classified into substrate-bound and unbound state through an unsupervised dimensionality reduction technique. On the contrary, if the recognition process is mediated by selection of transient protein conformation by the ligand, a clear correspondence between protein conformation and binding-competent macrostates can only be established via a combination of supervised machine learning (ML) and unsupervised dimension reduction approach. In such scenario, we demonstrate that a priori random forest based supervised classification of the simulated trajectories recognition process would help characterize key amino-acid residue-pairs of the protein that are deemed sensitive for ligand binding. A subsequent unsupervised dimensional reduction via time-lagged independent component analysis of the selected residue-pairs would delineate a conformational landscape of protein which is able to demarcate ligand-bound pose from the unbound ones. As a key breakthrough, the ML-based protocol would identify distal protein locations which would be allosterically important for ligand binding and characterise their roles in recognition pathways.

Download Full-text

On the Unfounded Enthusiasm for Soft Selective Sweeps III: The Supervised Machine Learning Algorithm That Isn’t

Genes ◽

10.3390/genes12040527 ◽

2021 ◽

Vol 12 (4) ◽

pp. 527

Author(s):

Eran Elhaik ◽

Dan Graur

Keyword(s):

Machine Learning ◽

Learning Algorithm ◽

A Priori ◽

Neutral Theory ◽

Dominant Mode ◽

Supervised Machine Learning ◽

Training Dataset ◽

Selective Sweeps ◽

Two Factors ◽

Negative Controls

In the last 15 years or so, soft selective sweep mechanisms have been catapulted from a curiosity of little evolutionary importance to a ubiquitous mechanism claimed to explain most adaptive evolution and, in some cases, most evolution. This transformation was aided by a series of articles by Daniel Schrider and Andrew Kern. Within this series, a paper entitled “Soft sweeps are the dominant mode of adaptation in the human genome” (Schrider and Kern, Mol. Biol. Evolut. 2017, 34(8), 1863–1877) attracted a great deal of attention, in particular in conjunction with another paper (Kern and Hahn, Mol. Biol. Evolut. 2018, 35(6), 1366–1371), for purporting to discredit the Neutral Theory of Molecular Evolution (Kimura 1968). Here, we address an alleged novelty in Schrider and Kern’s paper, i.e., the claim that their study involved an artificial intelligence technique called supervised machine learning (SML). SML is predicated upon the existence of a training dataset in which the correspondence between the input and output is known empirically to be true. Curiously, Schrider and Kern did not possess a training dataset of genomic segments known a priori to have evolved either neutrally or through soft or hard selective sweeps. Thus, their claim of using SML is thoroughly and utterly misleading. In the absence of legitimate training datasets, Schrider and Kern used: (1) simulations that employ many manipulatable variables and (2) a system of data cherry-picking rivaling the worst excesses in the literature. These two factors, in addition to the lack of negative controls and the irreproducibility of their results due to incomplete methodological detail, lead us to conclude that all evolutionary inferences derived from so-called SML algorithms (e.g., S/HIC) should be taken with a huge shovel of salt.

Download Full-text

Text Classification and Tagging of United States Army Ground Vehicle Fault Descriptions in Support of Data-Driven Prognostics

Annual Conference of the PHM Society ◽

10.36001/phmconf.2020.v12i1.1154 ◽

2020 ◽

Vol 12 (1) ◽

pp. 8

Author(s):

Brandon Hansen ◽

Cody Coleman ◽

Yi Zhang ◽

Maria Seale

Keyword(s):

Machine Learning ◽

Natural Language ◽

A Priori ◽

Supervised Machine Learning ◽

Data Driven ◽

A Priori Knowledge ◽

Data Sets ◽

Ground Vehicle ◽

Us Army ◽

Priori Knowledge

The manner in which a prognostics problem is framed is critical for enabling its solution by the proper method. Recently, data-driven prognostics techniques have demonstrated enormous potential when used alone, or as part of a hybrid solution in conjunction with physics-based models. Historical maintenance data constitutes a critical element for the use of a data-driven approach to prognostics, such as supervised machine learning. The historical data is used to create training and testing data sets to develop the machine learning model. Categorical classes for prediction are required for machine learning methods; however, faults of interest in US Army Ground Vehicle Maintenance Records appear as natural language text descriptions rather than a finite set of discrete labels. Transforming linguistically complex data into a set of prognostics classes is necessary for utilizing supervised machine learning approaches for prognostics. Manually labeling fault description instances is effective, but extremely time-consuming; thus, an automated approach to labelling is preferred. The approach described in this paper examines key aspects of the fault text relevant to enabling automatic labeling. A method was developed based on the hypothesis that a given fault description could be generalized into a category. This method uses various natural language processing (NLP) techniques and a priori knowledge of ground vehicle faults to assign classes to the maintenance fault descriptions. The core component of the method used in this paper is a Word2Vec word-embedding model. Word embeddings are used in conjunction with a token-oriented rule-based data structure for document classification. This methodology tags text with user-provided classes using a corpus of similar text fields as its training set. With classes of faults reliably assigned to a given description, supervised machine learning with these classes can be applied using related maintenance information that preceded the fault. This method was developed for labeling US Army Ground Vehicle Maintenance Records, but is general enough to be applied to any natural language data sets accompanied with a priori knowledge of its contents for consistent labeling. In addition to applications in machine learning, generated labels are also conducive to general summarization and case-by-case analysis of faults. The maintenance components of interest for this current application are alternators and gaskets, with future development directed towards determining the RUL of these components based on the labeled data.

Download Full-text

A deep learning and novelty detection framework for rapid phenotyping in high-content screening

10.1101/134627 ◽

2017 ◽

Cited By ~ 2

Author(s):

Christoph Sommer ◽

Rudolf Hoefler ◽

Matthias Samwer ◽

Daniel W. Gerlich

Keyword(s):

Machine Learning ◽

Deep Learning ◽

Large Scale ◽

Novelty Detection ◽

A Priori ◽

Mitotic Cell ◽

Supervised Machine Learning ◽

High Content Screening ◽

Data Sets ◽

User Training

AbstractSupervised machine learning is a powerful and widely used method to analyze high-content screening data. Despite its accuracy, efficiency, and versatility, supervised machine learning has drawbacks, most notably its dependence on a priori knowledge of expected phenotypes and time-consuming classifier training. We provide a solution to these limitations with CellCognition Explorer, a generic novelty detection and deep learning framework. Application to several large-scale screening data sets on nuclear and mitotic cell morphologies demonstrates that CellCognition Explorer enables discovery of rare phenotypes without user training, which has broad implications for improved assay development in high-content screening.

Download Full-text

COVID-19-related conspiracy beliefs and their relationship with perceived stress and pre-existing conspiracy beliefs in a Prolific Academic sample: A replication and extension of Georgiou et al. (2020)

10.31234/osf.io/t62s7 ◽

2021 ◽

Author(s):

Mae Braud ◽

Aurore Gaboriaud ◽

Thibaud Ferry ◽

Wassila El Mardi ◽

Léa DA SILVA ◽

...

Keyword(s):

Machine Learning ◽

Perceived Stress ◽

A Priori ◽

Education Level ◽

Conspiracy Theory ◽

Supervised Machine Learning ◽

Attachment Avoidance ◽

Negative Attitudes ◽

Alpha Level ◽

Conspiracy Beliefs

The authors conducted a close replication of a study by Georgiou et al (2020), who found amongst 660 (reported in abstract) or 640 (reported in participant section) participants that 1) Covid-19 related conspiracy theory beliefs were strongly related to broader conspiracy theory beliefs, that 2) Covid-19 related conspiracy beliefs were higher in those with lower levels of education, and that 3) Covid-19 related conspiracy beliefs were positively (although weakly) correlated with more negative attitudes towards different individual items measuring the government’s response. Finally, they find that 4) Covid-19 beliefs were unrelated to self-reported stress. In a pre-registered replication and extension in a study sufficiently well-powered to detect f2 = 0.05, at an alpha level of .05, with an a priori power of .95, and with 5 Predictors in a multiple regression analysis, we do not find the same results. First, we find that education level is unrelated to Covid-19 related conspiracy beliefs, that stress is related to Covid-19 related conspiracy beliefs, but that the government’s response is indeed related to Covid-19 related conspiracy beliefs. We point out measurement problems in measuring conspiracy beliefs, extend the study through supervised machine learning by finding that attachment avoidance and anxiety are important predictors of conspiracy beliefs (Covid-19-related and beyond). Part of the differences between their and our study are likely due to differences in analysis approach; others may be due to the errors in Georgiou et al.’s (2020 reporting.

Download Full-text

On the Inapplicability of Supervised Machine Learning to Evolutionary Studies

10.20944/preprints202012.0214.v1 ◽

2020 ◽

Author(s):

Eran Elhaik ◽

Dan Graur

Keyword(s):

Machine Learning ◽

A Priori ◽

Supervised Machine Learning ◽

Training Dataset ◽

The Bible ◽

Human Genomes ◽

Evolutionary Studies ◽

Two Factors ◽

Negative Controls

Supervised machine learning (SML) is a powerful method for predicting a small number of well-defined output groups (e.g., potential buyers of a certain product) by taking as input a large number of known well-defined measurements (e.g., past purchases, income, ethnicity, gender, credit record, age, favorite color, favorite chewing gum). SML is predicated upon the existence of a training dataset in which the correspondence between the input and output is known to be true. SML has had enormous success in the world of commerce, and this success has prompted a few scientists to employ it in the study of molecular and genome evolution. Here, we list the properties of SML that make it an unsuitable tool in evolutionary studies. In particular, we argue that SML cannot be used in an evolutionary exploratory context for the simple reason that training datasets that are known to be a priori true do not exist. As a case study, we use an SML study in which it was concluded that most human genomes evolve by positive selection through soft selective sweeps (Schrider and Kern 2017). We show that in the absence of legitimate training datasets, Schrider and Kern (2017) used (1) simulations that employ many manipulatable variables and (2) a system of cherry-picking data that would put to shame most modern evangelical exegeses of the Bible. These two factors, in addition to the lack of methodological detail and the lack of either negative controls or corrections for multiple comparisons, lead us to conclude that all evolutionary inferences derived from so-called SML algorithms (e.g., discoal) should be taken with a huge shovel of salt.

Download Full-text

Computational Prediction of Binding Affinity for CDK2-ligand Complexes. A Protein Target for Cancer Drug Discovery

Current Medicinal Chemistry ◽

10.2174/0929867328666210806105810 ◽

2021 ◽

Vol 28 ◽

Author(s):

Martina Veit-Acosta ◽

Walter Filgueira de Azevedo Junior

Keyword(s):

Machine Learning ◽

Ligand Binding ◽

Binding Affinity ◽

Physical Modeling ◽

Predictive Performance ◽

Supervised Machine Learning ◽

Machine Learning Techniques ◽

Scoring Functions ◽

Protein Target ◽

Learning Techniques

Background: CDK2 participates in the control of eukaryotic cell-cycle progression. Due to the great interest in CDK2 for drug development and the relative easiness in crystallizing this enzyme, we have over 400 structural studies focused on this protein target. This structural data is the basis for the development of computational models to estimate CDK2-ligand binding affinity. Objective: This work focuses on the recent developments in the application of supervised machine learning modeling to develop scoring functions to predict the binding affinity of CDK2. Method: We employed the structures available at the protein data bank and the ligand information accessed from the BindingDB, Binding MOAD, and PDBbind to evaluate the predictive performance of machine learning techniques combined with physical modeling used to calculate binding affinity. We compared this hybrid methodology with classical scoring functions available in docking programs. Results: Our comparative analysis of previously published models indicated that a model created using a combination of a mass-spring system and cross-validated Elastic Net to predict the binding affinity of CDK2-inhibitor complexes outperformed classical scoring functions available in AutoDock4 and AutoDock Vina. Conclusion: All studies reviewed here suggest that targeted machine learning models are superior to classical scoring functions to calculate binding affinities. Specifically for CDK2, we see that the combination of physical modeling with supervised machine learning techniques exhibits improved predictive performance to calculate the protein-ligand binding affinity. These results find theoretical support in the application of the concept of scoring function space.

Download Full-text

Predicting lipid and ligand binding sites in TRPV1 channel by molecular dynamics simulation and machine learning

Proteins Structure Function and Bioinformatics ◽

10.1002/prot.26075 ◽

2021 ◽

Author(s):

Wenjun Zheng ◽

Han Wen

Keyword(s):

Machine Learning ◽

Molecular Dynamics ◽

Molecular Dynamics Simulation ◽

Ligand Binding ◽

Binding Sites ◽

Dynamics Simulation ◽

Trpv1 Channel ◽

Ligand Binding Sites

Download Full-text

A deep learning and novelty detection framework for rapid phenotyping in high-content screening

Molecular Biology of the Cell ◽

10.1091/mbc.e17-05-0333 ◽

2017 ◽

Vol 28 (23) ◽

pp. 3428-3436 ◽

Cited By ~ 34

Author(s):

Christoph Sommer ◽

Rudolf Hoefler ◽

Matthias Samwer ◽

Daniel W. Gerlich

Keyword(s):

Machine Learning ◽

Deep Learning ◽

Large Scale ◽

Novelty Detection ◽

A Priori ◽

Mitotic Cell ◽

Supervised Machine Learning ◽

High Content Screening ◽

Data Sets ◽

User Training

Supervised machine learning is a powerful and widely used method for analyzing high-content screening data. Despite its accuracy, efficiency, and versatility, supervised machine learning has drawbacks, most notably its dependence on a priori knowledge of expected phenotypes and time-consuming classifier training. We provide a solution to these limitations with CellCognition Explorer, a generic novelty detection and deep learning framework. Application to several large-scale screening data sets on nuclear and mitotic cell morphologies demonstrates that CellCognition Explorer enables discovery of rare phenotypes without user training, which has broad implications for improved assay development in high-content screening.

Download Full-text

Supervised Machine Learning Methods Applied to Predict Ligand- Binding Affinity

Current Medicinal Chemistry ◽

10.2174/0929867324666170623092503 ◽

2017 ◽

Vol 24 (23) ◽

Cited By ~ 32

Author(s):

Gabriela S. Heck ◽

Val O. Pintro ◽

Richard R. Pereira ◽

Mauricio B. de Ávila ◽

Nayara M.B. Levin ◽

...

Keyword(s):

Machine Learning ◽

Ligand Binding ◽

Binding Affinity ◽

Supervised Machine Learning ◽

Learning Methods ◽

Machine Learning Methods

Download Full-text

On the Inapplicability of Supervised Machine Learning to Studying the Driving Forces of Evolution

10.20944/preprints202012.0214.v2 ◽

2021 ◽

Author(s):

Eran Elhaik ◽

Dan Graur

Keyword(s):

Machine Learning ◽

Driving Forces ◽

A Priori ◽

Supervised Machine Learning ◽

Training Dataset ◽

The Bible ◽

Human Genomes ◽

Two Factors ◽

Negative Controls

Supervised machine learning (SML) is a powerful method for predicting a small number of well-defined output groups (e.g., potential buyers of a certain product) by taking as input a large number of known well-defined measurements (e.g., past purchases, income, ethnicity, gender, credit record, age, favorite color, favorite chewing gum). SML is predicated upon the existence of a training dataset in which the correspondence between the input and output is known to be true. SML has had enormous success in the world of commerce, and this success may have prompted a few scientists to employ it in the study of molecular and genome evolution. Here, we list the properties of SML that make it an unsuitable tool in certain evolutionary studies. In particular, we argue that SML cannot be used in an evolutionary exploratory context for the simple reason that training datasets that are known to be a priori true do not exist. As a case study, we use an SML study in which it was concluded that most human genomes evolve by positive selection through soft selective sweeps (Schrider and Kern 2017). We show that in the absence of legitimate training datasets, Schrider and Kern (2017) used (1) simulations that employ many manipulatable variables and (2) a system of cherry-picking data that would put to shame most modern evangelical exegeses of the Bible. These two factors, in addition to the lack of methodological detail and negative controls, lead us to conclude that all evolutionary inferences derived from so-called SML algorithms (e.g., S-HIC) should be taken with a huge shovel of salt.

Download Full-text