Random Forest Refinement of Pairwise Potentials for Protein-ligand Decoy Detection

An accurate scoring function is expected to correctly select the most stable structure from a set of pose candidates. One can hypothesize that a scoring function’s ability to identify the most stable structure might be improved by emphasizing the most relevant atom pairwise interactions. However, it is hard to evaluate the relevant importance for each atom pair using traditional means. With the introduction of machine learning methods, it has become possible to determine the relative importance for each atom pair present in a scoring function. In this work, we use the Random Forest (RF) method to refine a pair potential developed by our laboratory (GARF6) by identifying relevant atom pairs that optimize the performance of the potential on our given task. Our goal is to construct a machine learning (ML) model that can accurately differentiate the native ligand binding pose from candidate poses using a potential refined by RF optimization. We successfully constructed RF models on an unbalanced data set with the ‘comparison’ concept and, the resultant RF models were tested on CASF-2013.5 In a comparison of the performance of our RF models against 29 scoring functions, we found our models outperformed the other scoring functions in predicting the native pose. In addition, we used two artificial designed potential models to address the importance of the GARF potential in the RF models: (1) a scrambled probability function set, which was obtained by mixing up atom pairs and probability functions in GARF, and (2) a uniform probability function set, which share the same peak positions with GARF but have fixed peak heights. The results of accuracy comparison from RF models based on the scrambled, uniform, and original GARF potential clearly showed that the peak positions in the GARF potential are important while the well depths are not.

Download Full-text

Learning from Docked Ligands: Ligand-Based Features Rescue Structure-Based Scoring Functions When Trained On Docked Poses

10.26434/chemrxiv.13637756 ◽

2021 ◽

Author(s):

Fergus Boyles ◽

Charlotte M Deane ◽

Garrett Morris

Keyword(s):

Machine Learning ◽

Ligand Binding ◽

Crystal Structures ◽

Binding Affinity ◽

Scoring Function ◽

Scoring Functions ◽

Data Set ◽

Core Sets ◽

Strong Performance

Machine learning scoring functions for protein-ligand binding affinity have been found to consistently outperform classical scoring functions when trained and tested on crystal structures of bound protein-ligand complexes. However, it is less clear how these methods perform when applied to docked poses of complexes. We explore how the use of docked, rather than crystallographic, poses for both training and testing affects the performance of machine learning scoring functions. Using the PDBbind Core Sets as benchmarks, we show that the performance of a structure-based machine learning scoring function trained and tested on docked poses is lower than that of the same scoring function trained and tested on crystallographic poses. We construct a hybrid scoring function by combining both structure-based and ligand-based features, and show that its ability to predict binding affinity using docked poses is comparable to that of purely structure-based scoring functions trained and tested on crystal poses. Despite strong performance on docked poses of the PDBbind Core Sets, we find that our hybrid scoring function fails to generalise to anew data set, demonstrating the need for improved scoring functions and additional validation benchmarks. Code and data to reproduce our results are available from https://github.com/oxpig/learning-from-docked-poses.

Download Full-text

Learning from Docked Ligands: Ligand-Based Features Rescue Structure-Based Scoring Functions When Trained On Docked Poses

10.26434/chemrxiv.13637756.v1 ◽

2021 ◽

Author(s):

Fergus Boyles ◽

Charlotte M Deane ◽

Garrett Morris

Keyword(s):

Machine Learning ◽

Ligand Binding ◽

Crystal Structures ◽

Binding Affinity ◽

Scoring Function ◽

Scoring Functions ◽

Data Set ◽

Core Sets ◽

Strong Performance

Machine learning scoring functions for protein-ligand binding affinity have been found to consistently outperform classical scoring functions when trained and tested on crystal structures of bound protein-ligand complexes. However, it is less clear how these methods perform when applied to docked poses of complexes. We explore how the use of docked, rather than crystallographic, poses for both training and testing affects the performance of machine learning scoring functions. Using the PDBbind Core Sets as benchmarks, we show that the performance of a structure-based machine learning scoring function trained and tested on docked poses is lower than that of the same scoring function trained and tested on crystallographic poses. We construct a hybrid scoring function by combining both structure-based and ligand-based features, and show that its ability to predict binding affinity using docked poses is comparable to that of purely structure-based scoring functions trained and tested on crystal poses. Despite strong performance on docked poses of the PDBbind Core Sets, we find that our hybrid scoring function fails to generalise to anew data set, demonstrating the need for improved scoring functions and additional validation benchmarks. Code and data to reproduce our results are available from https://github.com/oxpig/learning-from-docked-poses.

Download Full-text

Learning from the Ligand: Using Ligand-Based Features to Improve Binding Affinity Prediction

10.26434/chemrxiv.8174525.v1 ◽

2019 ◽

Cited By ~ 1

Author(s):

Fergus Boyles ◽

Charlotte M Deane ◽

Garrett Morris

Keyword(s):

Machine Learning ◽

Random Forest ◽

Binding Affinity ◽

Pearson Correlation ◽

Scoring Function ◽

Scoring Functions ◽

Limited Information ◽

Ligand Complex ◽

Binding Affinity Prediction ◽

Affinity Prediction

Machine learning scoring functions for protein-ligand binding affinity prediction have been found to consistently outperform classical scoring functions. Structure-based scoring functions for universal affinity prediction typically use features describing interactions derived from the protein-ligand complex, with limited information about the chemical or topological properties of the ligand itself. We demonstrate that the performance of machine learning scoring functions are consistently improved by the inclusion of diverse ligand-based features. For example, a Random Forest combining the features of RF-Score v3 with RDKit molecular descriptors achieved Pearson correlation coefficients of up to 0.831, 0.785, and 0.821 on the PDBbind 2007, 2013, and 2016 core sets respectively, compared to 0.790, 0.737, and 0.797 when using the features of RF-Score v3 alone. Excluding proteins and/or ligands that are similar to those in the test sets from the training set has a significant effect on scoring function performance, but does not remove the predictive power of ligand-based features. Furthermore a Random Forest using only ligand-based features is predictive at a level similar to classical scoring functions and it appears to be predicting the mean binding affinity of a ligand for its protein targets.

Download Full-text

Learning from the Ligand: Using Ligand-Based Features to Improve Binding Affinity Prediction

10.26434/chemrxiv.8174525 ◽

2019 ◽

Cited By ~ 1

Author(s):

Fergus Boyles ◽

Charlotte M Deane ◽

Garrett Morris

Keyword(s):

Machine Learning ◽

Random Forest ◽

Binding Affinity ◽

Pearson Correlation ◽

Scoring Function ◽

Scoring Functions ◽

Limited Information ◽

Ligand Complex ◽

Binding Affinity Prediction ◽

Affinity Prediction

Machine learning scoring functions for protein-ligand binding affinity prediction have been found to consistently outperform classical scoring functions. Structure-based scoring functions for universal affinity prediction typically use features describing interactions derived from the protein-ligand complex, with limited information about the chemical or topological properties of the ligand itself. We demonstrate that the performance of machine learning scoring functions are consistently improved by the inclusion of diverse ligand-based features. For example, a Random Forest combining the features of RF-Score v3 with RDKit molecular descriptors achieved Pearson correlation coefficients of up to 0.831, 0.785, and 0.821 on the PDBbind 2007, 2013, and 2016 core sets respectively, compared to 0.790, 0.737, and 0.797 when using the features of RF-Score v3 alone. Excluding proteins and/or ligands that are similar to those in the test sets from the training set has a significant effect on scoring function performance, but does not remove the predictive power of ligand-based features. Furthermore a Random Forest using only ligand-based features is predictive at a level similar to classical scoring functions and it appears to be predicting the mean binding affinity of a ligand for its protein targets.

Download Full-text

Application of random forest regression to the calculation of gas-phase chemistry within the GEOS-Chem chemistry model v10

Geoscientific Model Development ◽

10.5194/gmd-12-1209-2019 ◽

2019 ◽

Vol 12 (3) ◽

pp. 1209-1225 ◽

Cited By ~ 15

Author(s):

Christoph A. Keller ◽

Mat J. Evans

Keyword(s):

Machine Learning ◽

Random Forest ◽

Gas Phase ◽

Atmospheric Chemistry ◽

Random Forest Regression ◽

Data Set ◽

Gas Phase Chemistry ◽

Chemical Conditions ◽

Phase Chemistry ◽

The Impact

Abstract. Atmospheric chemistry models are a central tool to study the impact of chemical constituents on the environment, vegetation and human health. These models are numerically intense, and previous attempts to reduce the numerical cost of chemistry solvers have not delivered transformative change. We show here the potential of a machine learning (in this case random forest regression) replacement for the gas-phase chemistry in atmospheric chemistry transport models. Our training data consist of 1 month (July 2013) of output of chemical conditions together with the model physical state, produced from the GEOS-Chem chemistry model v10. From this data set we train random forest regression models to predict the concentration of each transported species after the integrator, based on the physical and chemical conditions before the integrator. The choice of prediction type has a strong impact on the skill of the regression model. We find best results from predicting the change in concentration for long-lived species and the absolute concentration for short-lived species. We also find improvements from a simple implementation of chemical families (NOx = NO + NO2). We then implement the trained random forest predictors back into GEOS-Chem to replace the numerical integrator. The machine-learning-driven GEOS-Chem model compares well to the standard simulation. For ozone (O3), errors from using the random forests (compared to the reference simulation) grow slowly and after 5 days the normalized mean bias (NMB), root mean square error (RMSE) and R2 are 4.2 %, 35 % and 0.9, respectively; after 30 days the errors increase to 13 %, 67 % and 0.75, respectively. The biases become largest in remote areas such as the tropical Pacific where errors in the chemistry can accumulate with little balancing influence from emissions or deposition. Over polluted regions the model error is less than 10 % and has significant fidelity in following the time series of the full model. Modelled NOx shows similar features, with the most significant errors occurring in remote locations far from recent emissions. For other species such as inorganic bromine species and short-lived nitrogen species, errors become large, with NMB, RMSE and R2 reaching >2100 % >400 % and <0.1, respectively. This proof-of-concept implementation takes 1.8 times more time than the direct integration of the differential equations, but optimization and software engineering should allow substantial increases in speed. We discuss potential improvements in the implementation, some of its advantages from both a software and hardware perspective, its limitations, and its applicability to operational air quality activities.

Download Full-text

Estimation of Soil Cohesion Using Machine Learning Method: A Random Forest Approach

Advances in Civil Engineering ◽

10.1155/2021/8873993 ◽

2021 ◽

Vol 2021 ◽

pp. 1-14

Author(s):

Hai-Bang Ly ◽

Thuy-Anh Nguyen ◽

Binh Thai Pham

Keyword(s):

Machine Learning ◽

Random Forest ◽

Soil Properties ◽

Clay Content ◽

Absolute Error ◽

Experimental Methods ◽

Liquid Limit ◽

Support Vector ◽

Data Set ◽

Soil Cohesion

Soil cohesion (C) is one of the critical soil properties and is closely related to basic soil properties such as particle size distribution, pore size, and shear strength. Hence, it is mainly determined by experimental methods. However, the experimental methods are often time-consuming and costly. Therefore, developing an alternative approach based on machine learning (ML) techniques to solve this problem is highly recommended. In this study, machine learning models, namely, support vector machine (SVM), Gaussian regression process (GPR), and random forest (RF), were built based on a data set of 145 soil samples collected from the Da Nang-Quang Ngai expressway project, Vietnam. The database also includes six input parameters, that is, clay content, moisture content, liquid limit, plastic limit, specific gravity, and void ratio. The performance of the model was assessed by three statistical criteria, namely, the correlation coefficient (R), mean absolute error (MAE), and root mean square error (RMSE). The results demonstrated that the proposed RF model could accurately predict soil cohesion with high accuracy (R = 0.891) and low error (RMSE = 3.323 and MAE = 2.511), and its predictive capability is better than SVM and GPR. Therefore, the RF model can be used as a cost-effective approach in predicting soil cohesion forces used in the design and inspection of constructions.

Download Full-text

Selecting Machine-Learning Scoring Functions for Structure-Based Virtual Screening

10.26434/chemrxiv.12967160 ◽

2020 ◽

Author(s):

Pedro Ballester

Keyword(s):

Machine Learning ◽

Drug Discovery ◽

Virtual Screening ◽

Predictive Accuracy ◽

Scoring Function ◽

3D Models ◽

Large Datasets ◽

Scoring Functions ◽

Discovery Process ◽

Drug Discovery Process

Interest in docking technologies has grown parallel to the ever increasing number and diversity of 3D models for macromolecular therapeutic targets. Structure-Based Virtual Screening (SBVS) aims at leveraging these experimental structures to discover the necessary starting points for the drug discovery process. It is now established that Machine Learning (ML) can strongly enhance the predictive accuracy of scoring functions for SBVS by exploiting large datasets from targets, molecules and their associations. However, with greater choice, the question of which ML-based scoring function is the most suitable for prospective use on a given target has gained importance. Here we analyse two approaches to select an existing scoring function for the target along with a third approach consisting in generating a scoring function tailored to the target. These analyses required discussing the limitations of popular SBVS benchmarks, the alternatives to benchmark scoring functions for SBVS and how to generate them or use them using freely-available software.

Download Full-text

Random Forest Refinement of the KECSA2 Knowledge-based Scoring Function for Protein Decoy Detection

10.26434/chemrxiv.7231058.v1 ◽

2018 ◽

Author(s):

Jun Pei ◽

Zheng Zheng ◽

Kenneth M. Merz Jr.

Keyword(s):

Random Forest ◽

Scoring Function ◽

Peak Height ◽

Data Sets ◽

Atom Pair ◽

Probability Functions ◽

Knowledge Based ◽

Pair Potentials ◽

Native Proteins ◽

Combined Data

In this work, via the use of the ‘comparison’ concept, Random Forest (RF) models were successfully generated using unbalanced data sets that assign different importance factors to atom pair potentials to enhance their ability to identify native proteins from decoy proteins. Individual and combined data sets consisting of twelve decoy sets were used to test the performance of the RF models. We find that RF models increase the recognition of native structures without affecting their ability to identify the best decoy structures. We also created models using scrambled atom types, which create physically unrealistic probability functions, in order to test the ability of the RF algorithm to create useful models based on inputted scrambled probability functions. From this test we find that we are unable to create models that are of similar quality relative to the unscrambled probability functions. Next we created uniform probability functions where the peak positions as the same as the original, but each interaction has the same peak height. Using these uniform potentials we were able to recover models as good as the ones using the full potentials suggesting all that is important in these models are the experimental peak positions.

Download Full-text

Prediction of Lung Cancer Risk using Random Forest Algorithm Based on Kaggle Data Set

International Journal of Recent Technology and Engineering - 2 ◽

10.35940/ijrte.f7879.038620 ◽

2020 ◽

Vol 8 (6) ◽

pp. 1623-1630

Keyword(s):

Machine Learning ◽

Lung Cancer ◽

Random Forest ◽

Naive Bayes ◽

Early Stage ◽

Naïve Bayes ◽

Training Data ◽

Random Forest Algorithm ◽

Data Set ◽

Wide Range

As huge amount of data accumulating currently, Challenges to draw out the required amount of data from available information is needed. Machine learning contributes to various fields. The fast-growing population caused the evolution of a wide range of diseases. This intern resulted in the need for the machine learning model that uses the patient's datasets. From different sources of datasets analysis, cancer is the most hazardous disease, it may cause the death of the forbearer. The outcome of the conducted surveys states cancer can be nearly cured in the initial stages and it may also cause the death of an affected person in later stages. One of the major types of cancer is lung cancer. It highly depends on the past data which requires detection in early stages. The recommended work is based on the machine learning algorithm for grouping the individual details into categories to predict whether they are going to expose to cancer in the early stage itself. Random forest algorithm is implemented, it results in more efficiency of 97% compare to KNN and Naive Bayes. Further, the KNN algorithm doesn't learn anything from training data but uses it for classification. Naive Bayes results in the inaccuracy of prediction. The proposed system is for predicting the chances of lung cancer by displaying three levels namely low, medium, and high. Thus, mortality rates can be reduced significantly.

Download Full-text