Exploration and Augmentation of Pharmacological Space via Adversarial Auto-encoder Model for Facilitating Kinase-centric Drug Development

AbstractPredicting compound–protein interactions (CPIs) is of great importance for drug discovery and repositioning, yet still challenging mainly due to the sparse nature of CPI matrixes, resulting in poor generalization performance. Hence, unlike typical CPI prediction models focused on representation learning or model selection, we propose a deep neural network-based strategy, PCM-AAE, that re-explores and augments the pharmacological space of kinase inhibitors by introducing the adversarial auto-encoder model (AAE) to improve the generalization of the prediction model. To complete the data space, we constructed Ensemble of PCM-AAE (EPA), an ensemble model that quickly and accurately yields quantitative predictions of binding affinity between any human kinase and inhibitor. In rigorous internal validation, EPA showed excellent performance, consistently outperforming the model trained with the imbalanced set, especially for targets with relatively fewer training data points. Improved prediction accuracy of EPA for external datasets enhances its generalization ability, making it possible to gracefully handle previously unseen kinases and inhibitors. EPA showed promising potential when directly applied to virtual screening and off-target prediction, exhibiting its practicality in hit prediction. Our strategy is expected to facilitate kinase-centric drug development, as well as to solve more challenging prediction problems with insufficient data points.

Download Full-text

Protein domain-based prediction of drug/compound–target interactions and experimental validation on LIM kinases

PLoS Computational Biology ◽

10.1371/journal.pcbi.1009171 ◽

2021 ◽

Vol 17 (11) ◽

pp. e1009171

Author(s):

Tunca Doğan ◽

Ece Akhan Güzelcan ◽

Marcus Baumann ◽

Altay Koyas ◽

Heval Atas ◽

...

Keyword(s):

Protein Interactions ◽

Protein Domain ◽

Training Data ◽

Drug Candidate ◽

Domain Pair ◽

Lim Kinase ◽

Target Proteins ◽

Data Points ◽

Drug Compound ◽

Compound Target

Predictive approaches such as virtual screening have been used in drug discovery with the objective of reducing developmental time and costs. Current machine learning and network-based approaches have issues related to generalization, usability, or model interpretability, especially due to the complexity of target proteins’ structure/function, and bias in system training datasets. Here, we propose a new method “DRUIDom” (DRUg Interacting Domain prediction) to identify bio-interactions between drug candidate compounds and targets by utilizing the domain modularity of proteins, to overcome problems associated with current approaches. DRUIDom is composed of two methodological steps. First, ligands/compounds are statistically mapped to structural domains of their target proteins, with the aim of identifying their interactions. As such, other proteins containing the same mapped domain or domain pair become new candidate targets for the corresponding compounds. Next, a million-scale dataset of small molecule compounds, including those mapped to domains in the previous step, are clustered based on their molecular similarities, and their domain associations are propagated to other compounds within the same clusters. Experimentally verified bioactivity data points, obtained from public databases, are meticulously filtered to construct datasets of active/interacting and inactive/non-interacting drug/compound–target pairs (~2.9M data points), and used as training data for calculating parameters of compound–domain mappings, which led to 27,032 high-confidence associations between 250 domains and 8,165 compounds, and a finalized output of ~5 million new compound–protein interactions. DRUIDom is experimentally validated by syntheses and bioactivity analyses of compounds predicted to target LIM-kinase proteins, which play critical roles in the regulation of cell motility, cell cycle progression, and differentiation through actin filament dynamics. We showed that LIMK-inhibitor-2 and its derivatives significantly block the cancer cell migration through inhibition of LIMK phosphorylation and the downstream protein cofilin. One of the derivative compounds (LIMKi-2d) was identified as a promising candidate due to its action on resistant Mahlavu liver cancer cells. The results demonstrated that DRUIDom can be exploited to identify drug candidate compounds for intended targets and to predict new target proteins based on the defined compound–domain relationships. Datasets, results, and the source code of DRUIDom are fully-available at: https://github.com/cansyl/DRUIDom.

Download Full-text

Probabilistic Random Forest improves bioactivity predictions close to the classification threshold by taking into account experimental uncertainty

Journal of Cheminformatics ◽

10.1186/s13321-021-00539-7 ◽

2021 ◽

Vol 13 (1) ◽

Author(s):

Lewis H. Mervin ◽

Maria-Anna Trapotsi ◽

Avid M. Afzal ◽

Ian P. Barrett ◽

Andreas Bender ◽

...

Keyword(s):

Random Forest ◽

Prediction Models ◽

Target Prediction ◽

Training Data ◽

Substantial Part ◽

Decision Threshold ◽

Experimental Uncertainty ◽

Protein Ligand Interactions ◽

Ligand Interactions ◽

Experimental Errors

AbstractMeasurements of protein–ligand interactions have reproducibility limits due to experimental errors. Any model based on such assays will consequentially have such unavoidable errors influencing their performance which should ideally be factored into modelling and output predictions, such as the actual standard deviation of experimental measurements (σ) or the associated comparability of activity values between the aggregated heterogenous activity units (i.e., Ki versus IC50 values) during dataset assimilation. However, experimental errors are usually a neglected aspect of model generation. In order to improve upon the current state-of-the-art, we herein present a novel approach toward predicting protein–ligand interactions using a Probabilistic Random Forest (PRF) classifier. The PRF algorithm was applied toward in silico protein target prediction across ~ 550 tasks from ChEMBL and PubChem. Predictions were evaluated by taking into account various scenarios of experimental standard deviations in both training and test sets and performance was assessed using fivefold stratified shuffled splits for validation. The largest benefit in incorporating the experimental deviation in PRF was observed for data points close to the binary threshold boundary, when such information was not considered in any way in the original RF algorithm. For example, in cases when σ ranged between 0.4–0.6 log units and when ideal probability estimates between 0.4–0.6, the PRF outperformed RF with a median absolute error margin of ~ 17%. In comparison, the baseline RF outperformed PRF for cases with high confidence to belong to the active class (far from the binary decision threshold), although the RF models gave errors smaller than the experimental uncertainty, which could indicate that they were overtrained and/or over-confident. Finally, the PRF models trained with putative inactives decreased the performance compared to PRF models without putative inactives and this could be because putative inactives were not assigned an experimental pXC50 value, and therefore they were considered inactives with a low uncertainty (which in practice might not be true). In conclusion, PRF can be useful for target prediction models in particular for data where class boundaries overlap with the measurement uncertainty, and where a substantial part of the training data is located close to the classification threshold.

Download Full-text

Probabilistic Random Forest Improves Bioactivity Predictions Close to the Classification Threshold by Taking into Account Experimental Uncertainty

10.26434/chemrxiv.14544291.v1 ◽

2021 ◽

Author(s):

Lewis Mervin ◽

Maria-Anna Trapotsi ◽

Avid M. Afzal ◽

Ian Barrett ◽

Andreas Bender ◽

...

Keyword(s):

Random Forest ◽

Experimental Error ◽

Prediction Models ◽

Binary Classification ◽

Target Prediction ◽

Training Data ◽

Substantial Part ◽

Decision Threshold ◽

Experimental Uncertainty ◽

Activity Data

In the context of small molecule property prediction, experimental errors are usually a neglected aspect during model generation. The main caveat to binary classification approaches is that they weight minority cases close to the threshold boundary equivalently in distinguishing between activity classes. For example, a pXC50 activity value of 5.1 or 4.9 are treated equally important in contributing to the opposing activity (e.g., classification threshold of 5), even though experimental error may not afford such discriminatory accuracy. This is detrimental in practice and therefore it is equally important to evaluate the presence of experimental error in databases and apply methodologies to account for variability in experiments and uncertainty near the decision boundary. In order to improve upon this, we herein present a novel approach toward predicting protein-ligand interactions using a Probabilistic Random Forest (PRF) classifier. The PRF comprises a modification to the long-established Random Forest (RF), to take into account uncertainties in the assigned classes (i.e., activity labels). This enables representing the activity in a framework in-between the classification and regression architecture, with philosophical differences from either approach. Compared to classification, this approach enables better representation of factors increasing/decreasing inactivity. Conversely, one can utilize all data (even delimited/operand/censored data far from a cut-off) at the same time as taking into account the granularity around the cut-off, compared to a classical regression framework. The algorithm was applied toward ~550 target prediction tasks from ChEMBL and PubChem. The largest benefit in incorporating the experimental deviation in PRF was observed for data points close to the binary threshold boundary, when such information is not considered in any way in the original RF algorithm. In comparison, the baseline RF outperformed PRF for cases with high confidence to belong to the active class (far from the binary decision threshold). The RF models gave errors smaller than the experimental uncertainty, which could indicate that they are overtrained and/or over-confident. Overall, we show that PRF can be useful for target prediction models in particular for data where class boundaries overlap with the measurement uncertainty, and where a substantial part of the training data is located close to the classification threshold. With this approach, we present, to our knowledge, for the first time an application of probabilistic modelling of activity data for target prediction using the PRF algorithm.

Download Full-text

Protein Domain-Based Prediction of Compound–Target Interactions and Experimental Validation on LIM Kinases

10.1101/2021.06.14.448307 ◽

2021 ◽

Author(s):

Tunca Doğan ◽

Ece Akhan Güzelcan ◽

Marcus Baumann ◽

Altay Koyas ◽

Heval Atas ◽

...

Keyword(s):

Protein Interactions ◽

Computational Method ◽

Protein Domain ◽

Training Data ◽

Drug Candidate ◽

Domain Pair ◽

Lim Kinase ◽

Target Proteins ◽

Data Points ◽

Compound Target

Predictive approaches such as virtual screening have been used in drug discovery with the objective of reducing developmental time and costs. Current machine learning and network-based approaches have issues related to generalization, usability, or model interpretability, especially due to the complexity of target proteins’ structure/function, and bias in system training datasets. Here, we propose a new computational method “DRUIDom” to predict bio-interactions between drug candidate compounds and target proteins by utilizing the domain modularity of proteins, to overcome problems associated with current approaches. DRUIDom is composed of two methodological steps. First, ligands/compounds are statistically mapped to structural domains of their target proteins, with the aim of identifying physical or functional interactions. As such, other proteins containing the mapped domain or domain pair become new candidate targets for the corresponding compounds. Next, a million-scale dataset of small molecule compounds, including the ones mapped to domains in the previous step, are clustered based on their molecular similarities, and their domain associations are propagated to other compounds within the same clusters. Experimentally verified bioactivity data points, obtained from public databases, are meticulously filtered to construct datasets of active/interacting and inactive/non-interacting compound–target pairs (~2.9M data points), and used as training data for calculating parameters of compound–domain mappings, which led to 27,032 high-confidence associations between 250 domains and 8,165 compounds, and a finalized output of ~5 million new compound–protein interactions. DRUIDom is experimentally validated by syntheses and bioactivity analyses of compounds predicted to target LIM-kinase proteins, which play critical roles in the regulation of cell motility, cell cycle progression, and differentiation through actin filament dynamics. We showed that LIMK-inhibitor-2 and its derivatives significantly block the cancer cell migration through inhibition of LIMK phosphorylation and the downstream protein cofilin. One of the derivative compounds (LIMKi-2d) was identified as a promising candidate due to its action on resistant Mahlavu liver cancer cells. The results demonstrated that DRUIDom can be exploited to identify drug candidate compounds for intended targets and to predict new target proteins based on the defined compound–domain relationships. The datasets, results, and the source code of DRUIDom are fully-available at: https://github.com/cansyl/DRUIDom.

Download Full-text

Comprehensive machine-learning-based analysis of microRNA–target interactions reveals variable transferability of interaction rules across species

BMC Bioinformatics ◽

10.1186/s12859-021-04164-x ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Gilad Ben Or ◽

Isana Veksler-Lublinsky

Keyword(s):

Machine Learning ◽

Mirna Target ◽

Data Science ◽

Prediction Models ◽

Target Prediction ◽

Evolutionary Distance ◽

Training Data ◽

Model Organisms ◽

Messenger Rnas ◽

Microrna Target

Abstract Background MicroRNAs (miRNAs) are small non-coding RNAs that regulate gene expression post-transcriptionally via base-pairing with complementary sequences on messenger RNAs (mRNAs). Due to the technical challenges involved in the application of high-throughput experimental methods, datasets of direct bona fide miRNA targets exist only for a few model organisms. Machine learning (ML)-based target prediction models were successfully trained and tested on some of these datasets. There is a need to further apply the trained models to organisms in which experimental training data are unavailable. However, it is largely unknown how the features of miRNA–target interactions evolve and whether some features have remained fixed during evolution, raising questions regarding the general, cross-species applicability of currently available ML methods. Results We examined the evolution of miRNA–target interaction rules and used data science and ML approaches to investigate whether these rules are transferable between species. We analyzed eight datasets of direct miRNA–target interactions in four species (human, mouse, worm, cattle). Using ML classifiers, we achieved high accuracy for intra-dataset classification and found that the most influential features of all datasets overlap significantly. To explore the relationships between datasets, we measured the divergence of their miRNA seed sequences and evaluated the performance of cross-dataset classification. We found that both measures coincide with the evolutionary distance between the compared species. Conclusions The transferability of miRNA–targeting rules between species depends on several factors, the most associated factors being the composition of seed families and evolutionary distance. Furthermore, our feature-importance results suggest that some miRNA–target features have evolved while others remained fixed during the evolution of the species. Our findings lay the foundation for the future development of target prediction tools that could be applied to “non-model” organisms for which minimal experimental data are available. Availability and implementation The code is freely available at https://github.com/gbenor/TPVOD.

Download Full-text

Probabilistic Random Forest Improves Bioactivity Predictions Close to the Classification Threshold by Taking into Account Experimental Uncertainty

10.26434/chemrxiv.14544291 ◽

2021 ◽

Author(s):

Lewis Mervin ◽

Maria-Anna Trapotsi ◽

Avid M. Afzal ◽

Ian Barrett ◽

Andreas Bender ◽

...

Keyword(s):

Random Forest ◽

Experimental Error ◽

Prediction Models ◽

Binary Classification ◽

Target Prediction ◽

Training Data ◽

Substantial Part ◽

Decision Threshold ◽

Experimental Uncertainty ◽

Activity Data

In the context of small molecule property prediction, experimental errors are usually a neglected aspect during model generation. The main caveat to binary classification approaches is that they weight minority cases close to the threshold boundary equivalently in distinguishing between activity classes. For example, a pXC50 activity value of 5.1 or 4.9 are treated equally important in contributing to the opposing activity (e.g., classification threshold of 5), even though experimental error may not afford such discriminatory accuracy. This is detrimental in practice and therefore it is equally important to evaluate the presence of experimental error in databases and apply methodologies to account for variability in experiments and uncertainty near the decision boundary. In order to improve upon this, we herein present a novel approach toward predicting protein-ligand interactions using a Probabilistic Random Forest (PRF) classifier. The PRF comprises a modification to the long-established Random Forest (RF), to take into account uncertainties in the assigned classes (i.e., activity labels). This enables representing the activity in a framework in-between the classification and regression architecture, with philosophical differences from either approach. Compared to classification, this approach enables better representation of factors increasing/decreasing inactivity. Conversely, one can utilize all data (even delimited/operand/censored data far from a cut-off) at the same time as taking into account the granularity around the cut-off, compared to a classical regression framework. The algorithm was applied toward ~550 target prediction tasks from ChEMBL and PubChem. The largest benefit in incorporating the experimental deviation in PRF was observed for data points close to the binary threshold boundary, when such information is not considered in any way in the original RF algorithm. In comparison, the baseline RF outperformed PRF for cases with high confidence to belong to the active class (far from the binary decision threshold). The RF models gave errors smaller than the experimental uncertainty, which could indicate that they are overtrained and/or over-confident. Overall, we show that PRF can be useful for target prediction models in particular for data where class boundaries overlap with the measurement uncertainty, and where a substantial part of the training data is located close to the classification threshold. With this approach, we present, to our knowledge, for the first time an application of probabilistic modelling of activity data for target prediction using the PRF algorithm.

Download Full-text

A Comparison of Scaling Methods to Obtain Calibrated Probabilities of Activity for Ligand-Target Predictions

10.26434/chemrxiv.11526132 ◽

2020 ◽

Author(s):

Lewis Mervin ◽

Avid M. Afzal ◽

Ola Engkvist ◽

Andreas Bender

Keyword(s):

Target Prediction ◽

Support Vector ◽

Machine Learning Method ◽

Learning Method ◽

Protein Target ◽

Bioactivity Prediction ◽

Vector Machines ◽

Scaling Methods ◽

Data Points ◽

Compound Target

In the context of bioactivity prediction, the question of how to calibrate a score produced by a machine learning method into reliable probability of binding to a protein target is not yet satisfactorily addressed. In this study, we compared the performance of three such methods, namely Platt Scaling, Isotonic Regression and Venn-ABERS in calibrating prediction scores for ligand-target prediction comprising the Naïve Bayes, Support Vector Machines and Random Forest algorithms with bioactivity data available at AstraZeneca (40 million data points (compound-target pairs) across 2112 targets). Performance was assessed using Stratified Shuffle Split (SSS) and Leave 20% of Scaffolds Out (L20SO) validation.

Download Full-text

Identifying Physico-Chemical Laws from the Robotically Collected Data

10.26434/chemrxiv.8490149 ◽

2019 ◽

Author(s):

Liwei Cao ◽

Danilo Russo ◽

Vassilios S. Vassiliadis ◽

Alexei Lapkin

Keyword(s):

Experimental Data ◽

Numerical Models ◽

Predictor Variable ◽

Physical Models ◽

Training Data ◽

Mixed Integer ◽

Physico Chemical ◽

Data Points ◽

Future Work ◽

The Relationship

A mixed-integer nonlinear programming (MINLP) formulation for symbolic regression was proposed to identify physical models from noisy experimental data. The formulation was tested using numerical models and was found to be more efficient than the previous literature example with respect to the number of predictor variables and training data points. The globally optimal search was extended to identify physical models and to cope with noise in the experimental data predictor variable. The methodology was coupled with the collection of experimental data in an automated fashion, and was proven to be successful in identifying the correct physical models describing the relationship between the shear stress and shear rate for both Newtonian and non-Newtonian fluids, and simple kinetic laws of reactions. Future work will focus on addressing the limitations of the formulation presented in this work, by extending it to be able to address larger complex physical models.

Download Full-text

Recent Advances on the Semi-Supervised Learning for Long Non-Coding RNA-Protein Interactions Prediction: A Review

Protein and Peptide Letters ◽

10.2174/0929866526666191025104043 ◽

2020 ◽

Vol 27 (5) ◽

pp. 385-391

Author(s):

Lin Zhong ◽

Zhong Ming ◽

Guobo Xie ◽

Chunlong Fan ◽

Xue Piao

Keyword(s):

Supervised Learning ◽

Protein Interactions ◽

Computational Models ◽

Prediction Models ◽

Chromatin Modification ◽

Computational Prediction ◽

Human Diseases ◽

Future Research ◽

Non Coding Rna ◽

Long Non Coding Rna

: In recent years, more and more evidence indicates that long non-coding RNA (lncRNA) plays a significant role in the development of complex biological processes, especially in RNA progressing, chromatin modification, and cell differentiation, as well as many other processes. Surprisingly, lncRNA has an inseparable relationship with human diseases such as cancer. Therefore, only by knowing more about the function of lncRNA can we better solve the problems of human diseases. However, lncRNAs need to bind to proteins to perform their biomedical functions. So we can reveal the lncRNA function by studying the relationship between lncRNA and protein. But due to the limitations of traditional experiments, researchers often use computational prediction models to predict lncRNA protein interactions. In this review, we summarize several computational models of the lncRNA protein interactions prediction base on semi-supervised learning during the past two years, and introduce their advantages and shortcomings briefly. Finally, the future research directions of lncRNA protein interaction prediction are pointed out.

Download Full-text