scholarly journals Synergy and Complementarity between Focused Machine Learning and Physics-Based Simulation in Affinity Prediction

Author(s):  
Ann E. Cleves ◽  
Stephen R. Johnson ◽  
Ajay N. Jain
2021 ◽  
Vol 7 (1) ◽  
Author(s):  
Taher Hajilounezhad ◽  
Rina Bao ◽  
Kannappan Palaniappan ◽  
Filiz Bunyak ◽  
Prasad Calyam ◽  
...  

AbstractUnderstanding and controlling the self-assembly of vertically oriented carbon nanotube (CNT) forests is essential for realizing their potential in myriad applications. The governing process–structure–property mechanisms are poorly understood, and the processing parameter space is far too vast to exhaustively explore experimentally. We overcome these limitations by using a physics-based simulation as a high-throughput virtual laboratory and image-based machine learning to relate CNT forest synthesis attributes to their mechanical performance. Using CNTNet, our image-based deep learning classifier module trained with synthetic imagery, combinations of CNT diameter, density, and population growth rate classes were labeled with an accuracy of >91%. The CNTNet regression module predicted CNT forest stiffness and buckling load properties with a lower root-mean-square error than that of a regression predictor based on CNT physical parameters. These results demonstrate that image-based machine learning trained using only simulated imagery can distinguish subtle CNT forest morphological features to predict physical material properties with high accuracy. CNTNet paves the way to incorporate scanning electron microscope imagery for high-throughput material discovery.


2020 ◽  
Author(s):  
Paul Francoeur ◽  
Tomohide Masuda ◽  
David R. Koes

One of the main challenges in drug discovery is predicting protein-ligand binding affinity. Recently, machine learning approaches have made substantial progress on this task. However, current methods of model evaluation are overly optimistic in measuring generalization to new targets, and there does not exist a standard dataset of sufficient size to compare performance between models. We present a new dataset for structure-based machine learning, the CrossDocked2020 set, with 22.5 million poses of ligands docked into multiple similar binding pockets across the Protein Data Bank and perform a comprehensive evaluation of grid-based convolutional neural network models on this dataset. We also demonstrate how the partitioning of the training data and test data can impact the results of models trained with the PDBbind dataset, how performance improves by adding more, lower-quality training data, and how training with docked poses imparts pose sensitivity to the predicted affinity of a complex. Our best performing model, an ensemble of 5 densely connected convolutional newtworks, achieves a root mean squared error of 1.42 and Pearson R of 0.612 on the affinity prediction task, an AUC of 0.956 at binding pose classification, and a 68.4% accuracy at pose selection on the CrossDocked2020 set. By providing data splits for clustered cross-validation and the raw data for the CrossDocked2020 set, we establish the first standardized dataset for training machine learning models to recognize ligands in non-cognate target structures while also greatly expanding the number of poses available for training. In order to facilitate community adoption of this dataset for benchmarking protein-ligand binding affinity prediction, we provide our models, weights, and the CrossDocked2020 set at https://github.com/gnina/models.


2020 ◽  
Vol 13 (1) ◽  
Author(s):  
Wajid Arshad Abbasi ◽  
Adiba Yaseen ◽  
Fahad Ul Hassan ◽  
Saiqa Andleeb ◽  
Fayyaz Ul Amir Afsar Minhas

Abstract Background Determining binding affinity in protein-protein interactions is important in the discovery and design of novel therapeutics and mutagenesis studies. Determination of binding affinity of proteins in the formation of protein complexes requires sophisticated, expensive and time-consuming experimentation which can be replaced with computational methods. Most computational prediction techniques require protein structures that limit their applicability to protein complexes with known structures. In this work, we explore sequence-based protein binding affinity prediction using machine learning. Method We have used protein sequence information instead of protein structures along with machine learning techniques to accurately predict the protein binding affinity. Results We present our findings that the true generalization performance of even the state-of-the-art sequence-only predictor is far from satisfactory and that the development of machine learning methods for binding affinity prediction with improved generalization performance is still an open problem. We have also proposed a sequence-based novel protein binding affinity predictor called ISLAND which gives better accuracy than existing methods over the same validation set as well as on external independent test dataset. A cloud-based webserver implementation of ISLAND and its python code are available at https://sites.google.com/view/wajidarshad/software. Conclusion This paper highlights the fact that the true generalization performance of even the state-of-the-art sequence-only predictor of binding affinity is far from satisfactory and that the development of effective and practical methods in this domain is still an open problem.


2020 ◽  
Author(s):  
Paul Francoeur ◽  
Tomohide Masuda ◽  
David R. Koes

One of the main challenges in drug discovery is predicting protein-ligand binding affinity. Recently, machine learning approaches have made substantial progress on this task. However, current methods of model evaluation are overly optimistic in measuring generalization to new targets, and there does not exist a standard dataset of sufficient size to compare performance between models. We present a new dataset for structure-based machine learning, the CrossDocked2020 set, with 22.5 million poses of ligands docked into multiple similar binding pockets across the Protein Data Bank and perform a comprehensive evaluation of grid-based convolutional neural network models on this dataset. We also demonstrate how the partitioning of the training data and test data can impact the results of models trained with the PDBbind dataset, how performance improves by adding more, lower-quality training data, and how training with docked poses imparts pose sensitivity to the predicted affinity of a complex. Our best performing model, an ensemble of 5 densely connected convolutional newtworks, achieves a root mean squared error of 1.42 and Pearson R of 0.612 on the affinity prediction task, an AUC of 0.956 at binding pose classification, and a 68.4% accuracy at pose selection on the CrossDocked2020 set. By providing data splits for clustered cross-validation and the raw data for the CrossDocked2020 set, we establish the first standardized dataset for training machine learning models to recognize ligands in non-cognate target structures while also greatly expanding the number of poses available for training. In order to facilitate community adoption of this dataset for benchmarking protein-ligand binding affinity prediction, we provide our models, weights, and the CrossDocked2020 set at https://github.com/gnina/models.


Author(s):  
Maksym Druchok ◽  
Dzvenymyra Yarish ◽  
Sofiya Garkot ◽  
Tymofii Nikolaienko ◽  
Oleksandr Gurbych

Sign in / Sign up

Export Citation Format

Share Document