Unsupervised Machine Learning Approach for Identifying Biomechanical Influences on Protein-Ligand Binding Affinity

Learning Approach ◽

Drug Candidates ◽

Machine Learning Approach ◽

Novel Drug ◽

Future Work

Abstract Drug discovery is incredibly time-consuming and expensive, averaging over 10 years and $985 million per drug. Calculating the binding affinity between a target protein and a ligand is critical for discovering viable drugs. Although supervised machine learning (ML) can predict binding affinity accurately, models experience severe overfitting due to an inability to identify informative properties of protein-ligand complexes. This study used unsupervised ML to reveal underlying protein-ligand characteristics that strongly influence binding affinity. Protein-ligand 3D models were collected from the PDBBind database and vectorized into 2422 features per complex. Principal Component Analysis (PCA), t-Distributed Stochastic Neighbor Embedding (t-SNE), K-Means Clustering, and heatmaps were used to identify groups of complexes and the features responsible for the separation. ML benchmarking was used to determine the features’ effect on ML performance. The PCA heatmap revealed groups of complexes with binding affinity of pKd < 6 and pKd > 8, and identified the number of CCCH and CCCCCH fragments in the ligand as the most responsible features. A high correlation of 0.8337, their ability to explain 18% of the binding affinity’s variance, and an error increase of 0.09 in Decision Trees when trained without the two features suggests that the fragments exist within a larger ligand substructure that significantly influences binding affinity. This discovery is a baseline for informative ligand representations to be generated so that ML models overfit less and can more reliably identify novel drug candidates. Future work will focus on validating the ligand substructure’s presence and discovering more informative intra-ligand relationships.

Mol2vec: Unsupervised Machine Learning Approach with Chemical Intuition

10.26434/chemrxiv.5513581.v1 ◽

2017 ◽

Author(s):

Sabrina Jaeger ◽

Simone Fulle ◽

Samo Turk

Keyword(s):

Machine Learning ◽

Language Processing ◽

Learning Approach ◽

Learning Approaches ◽

Unsupervised Machine Learning ◽

Feature Representations ◽

Machine Learning Approach ◽

The Individual ◽

Vector Representations

Inspired by natural language processing techniques we here introduce Mol2vec which is an unsupervised machine learning approach to learn vector representations of molecular substructures. Similarly, to the Word2vec models where vectors of closely related words are in close proximity in the vector space, Mol2vec learns vector representations of molecular substructures that are pointing in similar directions for chemically related substructures. Compounds can finally be encoded as vectors by summing up vectors of the individual substructures and, for instance, feed into supervised machine learning approaches to predict compound properties. The underlying substructure vector embeddings are obtained by training an unsupervised machine learning approach on a so-called corpus of compounds that consists of all available chemical matter. The resulting Mol2vec model is pre-trained once, yields dense vector representations and overcomes drawbacks of common compound feature representations such as sparseness and bit collisions. The prediction capabilities are demonstrated on several compound property and bioactivity data sets and compared with results obtained for Morgan fingerprints as reference compound representation. Mol2vec can be easily combined with ProtVec, which employs the same Word2vec concept on protein sequences, resulting in a proteochemometric approach that is alignment independent and can be thus also easily used for proteins with low sequence similarities.

International Journal of Computer Science and Information Technology for Education ◽

Coreference resolution of Korean anaphoric zero objects: Towards a supervised machine learning approach

10.21742/ijcsite.2016.1.01 ◽

2016 ◽

Vol 1 (1) ◽

pp. 1-6

Author(s):

Euhee Kim ◽

◽

Myung-Kwan Park ◽

Keyword(s):

Machine Learning ◽

Learning Approach ◽

Coreference Resolution ◽

A Supervised Machine Learning Approach for the Credibility Assessment of User-Generated Content

Wireless Personal Communications ◽

10.1007/s11277-021-08136-5 ◽

2021 ◽

Author(s):

Praphula Kumar Jain ◽

Rajendra Pamula ◽

Sarfraj Ansari

Keyword(s):

Machine Learning ◽

User Generated Content ◽

Learning Approach ◽

Credibility Assessment ◽

Computer Methods in Biomechanics and Biomedical Engineering Imaging & Visualization ◽

Fast and robust supervised machine learning approach for classification and prediction of Parkinson’s disease onset

10.1080/21681163.2021.1941262 ◽

2021 ◽

pp. 1-17

Author(s):

Lavanya Madhuri Bollipo ◽

Kadambari K V

Keyword(s):

Machine Learning ◽

Parkinson’S Disease ◽

Parkinson's Disease ◽

Disease Onset ◽

Learning Approach ◽

2020 International Conference on System, Computation, Automation and Networking (ICSCAN) ◽

Supervised Machine Learning Approach For The Prediction of Breast Cancer

10.1109/icscan49426.2020.9262403 ◽

2020 ◽

Author(s):

Tarun Jain ◽

Vivek Kumar Verma ◽

Mahek Agarwal ◽

Anju Yadav ◽

Ashish Jain

Keyword(s):

Breast Cancer ◽

Machine Learning ◽

Learning Approach ◽

A supervised machine learning approach to author disambiguation in the Web of Science

Journal of Informetrics ◽

10.1016/j.joi.2021.101166 ◽

2021 ◽

Vol 15 (3) ◽

pp. 101166

Author(s):

Andreas Rehs

Keyword(s):

Machine Learning ◽

Web Of Science ◽

Learning Approach ◽

Machine Learning Approach ◽

Author Disambiguation ◽

The Web

A machine learning approach for the factorization of psychometric data with application to the Delis Kaplan Executive Function System

Scientific Reports ◽

10.1038/s41598-021-96342-3 ◽

2021 ◽

Vol 11 (1) ◽

Author(s):

J. A. Camilleri ◽

S. B. Eickhoff ◽

S. Weis ◽

J. Chen ◽

J. Amunts ◽

...

Keyword(s):

Machine Learning ◽

Executive Function ◽

Executive Functioning ◽

Factor Model ◽

Five Factor Model ◽

Principal Component ◽

Function System ◽

Learning Approach ◽

Psychometric Data ◽

AbstractWhile a replicability crisis has shaken psychological sciences, the replicability of multivariate approaches for psychometric data factorization has received little attention. In particular, Exploratory Factor Analysis (EFA) is frequently promoted as the gold standard in psychological sciences. However, the application of EFA to executive functioning, a core concept in psychology and cognitive neuroscience, has led to divergent conceptual models. This heterogeneity severely limits the generalizability and replicability of findings. To tackle this issue, in this study, we propose to capitalize on a machine learning approach, OPNMF (Orthonormal Projective Non-Negative Factorization), and leverage internal cross-validation to promote generalizability to an independent dataset. We examined its application on the scores of 334 adults at the Delis–Kaplan Executive Function System (D-KEFS), while comparing to standard EFA and Principal Component Analysis (PCA). We further evaluated the replicability of the derived factorization across specific gender and age subsamples. Overall, OPNMF and PCA both converge towards a two-factor model as the best data-fit model. The derived factorization suggests a division between low-level and high-level executive functioning measures, a model further supported in subsamples. In contrast, EFA, highlighted a five-factor model which reflects the segregation of the D-KEFS battery into its main tasks while still clustering higher-level tasks together. However, this model was poorly supported in the subsamples. Thus, the parsimonious two-factors model revealed by OPNMF encompasses the more complex factorization yielded by EFA while enjoying higher generalizability. Hence, OPNMF provides a conceptually meaningful, technically robust, and generalizable factorization for psychometric tools.