Remodelling structure-based drug design using machine learning

Author(s):  
Shubhankar Dutta ◽  
Kakoli Bose

To keep up with the pace of rapid discoveries in biomedicine, a plethora of research endeavors had been directed toward Rational Drug Development that slowly gave way to Structure-Based Drug Design (SBDD). In the past few decades, SBDD played a stupendous role in identification of novel drug-like molecules that are capable of altering the structures and/or functions of the target macromolecules involved in different disease pathways and networks. Unfortunately, post-delivery drug failures due to adverse drug interactions have constrained the use of SBDD in biomedical applications. However, recent technological advancements, along with parallel surge in clinical research have led to the concomitant establishment of other powerful computational techniques such as Artificial Intelligence (AI) and Machine Learning (ML). These leading-edge tools with the ability to successfully predict side-effects of a wide range of drugs have eventually taken over the field of drug design. ML, a subset of AI, is a robust computational tool that is capable of data analysis and analytical model building with minimal human intervention. It is based on powerful algorithms that use huge sets of ‘training data’ as inputs to predict new output values, which improve iteratively through experience. In this review, along with a brief discussion on the evolution of the drug discovery process, we have focused on the methodologies pertaining to the technological advancements of machine learning. This review, with specific examples, also emphasises the tremendous contributions of ML in the field of biomedicine, while exploring possibilities for future developments.

2019 ◽  
Author(s):  
Mohammad Rezaei ◽  
Yanjun Li ◽  
Xiaolin Li ◽  
Chenglong Li

<b>Introduction:</b> The ability to discriminate among ligands binding to the same protein target in terms of their relative binding affinity lies at the heart of structure-based drug design. Any improvement in the accuracy and reliability of binding affinity prediction methods decreases the discrepancy between experimental and computational results.<br><b>Objectives:</b> The primary objectives were to find the most relevant features affecting binding affinity prediction, least use of manual feature engineering, and improving the reliability of binding affinity prediction using efficient deep learning models by tuning the model hyperparameters.<br><b>Methods:</b> The binding site of target proteins was represented as a grid box around their bound ligand. Both binary and distance-dependent occupancies were examined for how an atom affects its neighbor voxels in this grid. A combination of different features including ANOLEA, ligand elements, and Arpeggio atom types were used to represent the input. An efficient convolutional neural network (CNN) architecture, DeepAtom, was developed, trained and tested on the PDBbind v2016 dataset. Additionally an extended benchmark dataset was compiled to train and evaluate the models.<br><b>Results: </b>The best DeepAtom model showed an improved accuracy in the binding affinity prediction on PDBbind core subset (Pearson’s R=0.83) and is better than the recent state-of-the-art models in this field. In addition when the DeepAtom model was trained on our proposed benchmark dataset, it yields higher correlation compared to the baseline which confirms the value of our model.<br><b>Conclusions:</b> The promising results for the predicted binding affinities is expected to pave the way for embedding deep learning models in virtual screening and rational drug design fields.


2020 ◽  
Vol 26 (42) ◽  
pp. 7623-7640 ◽  
Author(s):  
Cheolhee Kim ◽  
Eunae Kim

: Rational drug design is accomplished through the complementary use of structural biology and computational biology of biological macromolecules involved in disease pathology. Most of the known theoretical approaches for drug design are based on knowledge of the biological targets to which the drug binds. This approach can be used to design drug molecules that restore the balance of the signaling pathway by inhibiting or stimulating biological targets by molecular modeling procedures as well as by molecular dynamics simulations. Type III receptor tyrosine kinase affects most of the fundamental cellular processes including cell cycle, cell migration, cell metabolism, and survival, as well as cell proliferation and differentiation. Many inhibitors of successful rational drug design show that some computational techniques can be combined to achieve synergistic effects.


2021 ◽  
Author(s):  
Mehrdad Gharib Shirangi ◽  
Roger Aragall ◽  
Reza Ettehadi ◽  
Roland May ◽  
Edward Furlong ◽  
...  

Abstract In this work, we present our advances to develop and apply digital twins for drilling fluids and associated wellbore phenomena during drilling operations. A drilling fluid digital twin is a series of interconnected models that incorporate the learning from the past historical data in a wide range of operational settings to determine the fluids properties in realtime operations. From several drilling fluid functionalities and operational parameters, we describe advancements to improve hole cleaning predictions and high-pressure high-temperature (HPHT) rheological properties monitoring. In the hole cleaning application, we consider the Clark and Bickham (1994) approach which requires the prediction of the local fluid velocity above the cuttings bed as a function of operating conditions. We develop accurate computational fluid dynamics (CFD) models to capture the effects of rotation, eccentricity and bed height on local fluid velocities above cuttings bed. We then run 55,000 CFD simulations for a wide range of operational settings to generate training data for machine learning. For rheology monitoring, thousands of lab experiment records are collected as training data for machine learning. In this case, the HPHT rheological properties are determined based on rheological measurement in the American Petroleum Institute (API) condition together with the fluid type and composition data. We compare the results of application of several machine learning algorithms to represent CFD simulations (for hole cleaning application) and lab experiments (for monitoring HPHT rheological properties). Rotating cross-validation method is applied to ensure accurate and robust results. In both cases, models from the Gradient Boosting and the Artificial Neural Network algorithms provided the highest accuracy (about 0.95 in terms of R-squared) for test datasets. With developments presented in this paper, the hole cleaning calculations can be performed more accurately in real-time, and the HPHT rheological properties of drilling fluids can be estimated at the rigsite before performing the lab experiments. These contributions advance digital transformation of drilling operations.


2020 ◽  
Author(s):  
Paul Francoeur ◽  
Tomohide Masuda ◽  
David R. Koes

One of the main challenges in drug discovery is predicting protein-ligand binding affinity. Recently, machine learning approaches have made substantial progress on this task. However, current methods of model evaluation are overly optimistic in measuring generalization to new targets, and there does not exist a standard dataset of sufficient size to compare performance between models. We present a new dataset for structure-based machine learning, the CrossDocked2020 set, with 22.5 million poses of ligands docked into multiple similar binding pockets across the Protein Data Bank and perform a comprehensive evaluation of grid-based convolutional neural network models on this dataset. We also demonstrate how the partitioning of the training data and test data can impact the results of models trained with the PDBbind dataset, how performance improves by adding more, lower-quality training data, and how training with docked poses imparts pose sensitivity to the predicted affinity of a complex. Our best performing model, an ensemble of 5 densely connected convolutional newtworks, achieves a root mean squared error of 1.42 and Pearson R of 0.612 on the affinity prediction task, an AUC of 0.956 at binding pose classification, and a 68.4% accuracy at pose selection on the CrossDocked2020 set. By providing data splits for clustered cross-validation and the raw data for the CrossDocked2020 set, we establish the first standardized dataset for training machine learning models to recognize ligands in non-cognate target structures while also greatly expanding the number of poses available for training. In order to facilitate community adoption of this dataset for benchmarking protein-ligand binding affinity prediction, we provide our models, weights, and the CrossDocked2020 set at https://github.com/gnina/models.


2021 ◽  
Author(s):  
Haibin Di ◽  
Chakib Kada Kloucha ◽  
Cen Li ◽  
Aria Abubakar ◽  
Zhun Li ◽  
...  

Abstract Delineating seismic stratigraphic features and depositional facies is of importance to successful reservoir mapping and identification in the subsurface. Robust seismic stratigraphy interpretation is confronted with two major challenges. The first one is to maximally automate the process particularly with the increasing size of seismic data and complexity of target stratigraphies, while the second challenge is to efficiently incorporate available structures into stratigraphy model building. Machine learning, particularly convolutional neural network (CNN), has been introduced into assisting seismic stratigraphy interpretation through supervised learning. However, the small amount of available expert labels greatly restricts the performance of such supervised CNN. Moreover, most of the exiting CNN implementations are based on only amplitude, which fails to use necessary structural information such as faults for constraining the machine learning. To resolve both challenges, this paper presents a semi-supervised learning workflow for fault-guided seismic stratigraphy interpretation, which consists of two components. The first component is seismic feature engineering (SFE), which aims at learning the provided seismic and fault data through a unsupervised convolutional autoencoder (CAE), while the second one is stratigraphy model building (SMB), which aims at building an optimal mapping function between the features extracted from the SFE CAE and the target stratigraphic labels provided by an experienced interpreter through a supervised CNN. Both components are connected by embedding the encoder of the SFE CAE into the SMB CNN, which forces the SMB learning based on these features commonly existing in the entire study area instead of those only at the limited training data; correspondingly, the risk of overfitting is greatly eliminated. More innovatively, the fault constraint is introduced by customizing the SMB CNN of two output branches, with one to match the target stratigraphies and the other to reconstruct the input fault, so that the fault continues contributing to the process of SMB learning. The performance of such fault-guided seismic stratigraphy interpretation is validated by an application to a real seismic dataset, and the machine prediction not only matches the manual interpretation accurately but also clearly illustrates the depositional process in the study area.


Author(s):  
Khaled H. Barakat ◽  
Michael Houghton ◽  
D. Lorne Tyrrel ◽  
Jack A. Tuszynski

For the past three decades rationale drug design (RDD) has been developing as an innovative, rapid and successful way to discover new drug candidates. Many strategies have been followed and several targets with diverse structures and different biological roles have been investigated. Despite the variety of computational tools available, one can broadly divide them into two major classes that can be adopted either separately or in combination. The first class involves structure-based drug design, when the target's 3-dimensional structure is available or it can be computationally generated using homology modeling. On the other hand, when only a set of active molecules is available, and the structure of the target is unknown, ligand-based drug design tools are usually used. This review describes some recent advances in rational drug design, summarizes a number of their practical applications, and discusses both the advantages and shortcomings of the various techniques used.


Science ◽  
2021 ◽  
Vol 371 (6535) ◽  
pp. eabe8628
Author(s):  
Marshall Burke ◽  
Anne Driscoll ◽  
David B. Lobell ◽  
Stefano Ermon

Accurate and comprehensive measurements of a range of sustainable development outcomes are fundamental inputs into both research and policy. We synthesize the growing literature that uses satellite imagery to understand these outcomes, with a focus on approaches that combine imagery with machine learning. We quantify the paucity of ground data on key human-related outcomes and the growing abundance and improving resolution (spatial, temporal, and spectral) of satellite imagery. We then review recent machine learning approaches to model-building in the context of scarce and noisy training data, highlighting how this noise often leads to incorrect assessment of model performance. We quantify recent model performance across multiple sustainable development domains, discuss research and policy applications, explore constraints to future progress, and highlight research directions for the field.


2020 ◽  
Vol 8 (6) ◽  
pp. 1623-1630

As huge amount of data accumulating currently, Challenges to draw out the required amount of data from available information is needed. Machine learning contributes to various fields. The fast-growing population caused the evolution of a wide range of diseases. This intern resulted in the need for the machine learning model that uses the patient's datasets. From different sources of datasets analysis, cancer is the most hazardous disease, it may cause the death of the forbearer. The outcome of the conducted surveys states cancer can be nearly cured in the initial stages and it may also cause the death of an affected person in later stages. One of the major types of cancer is lung cancer. It highly depends on the past data which requires detection in early stages. The recommended work is based on the machine learning algorithm for grouping the individual details into categories to predict whether they are going to expose to cancer in the early stage itself. Random forest algorithm is implemented, it results in more efficiency of 97% compare to KNN and Naive Bayes. Further, the KNN algorithm doesn't learn anything from training data but uses it for classification. Naive Bayes results in the inaccuracy of prediction. The proposed system is for predicting the chances of lung cancer by displaying three levels namely low, medium, and high. Thus, mortality rates can be reduced significantly.


2020 ◽  
Author(s):  
Paul Francoeur ◽  
Tomohide Masuda ◽  
David R. Koes

One of the main challenges in drug discovery is predicting protein-ligand binding affinity. Recently, machine learning approaches have made substantial progress on this task. However, current methods of model evaluation are overly optimistic in measuring generalization to new targets, and there does not exist a standard dataset of sufficient size to compare performance between models. We present a new dataset for structure-based machine learning, the CrossDocked2020 set, with 22.5 million poses of ligands docked into multiple similar binding pockets across the Protein Data Bank and perform a comprehensive evaluation of grid-based convolutional neural network models on this dataset. We also demonstrate how the partitioning of the training data and test data can impact the results of models trained with the PDBbind dataset, how performance improves by adding more, lower-quality training data, and how training with docked poses imparts pose sensitivity to the predicted affinity of a complex. Our best performing model, an ensemble of 5 densely connected convolutional newtworks, achieves a root mean squared error of 1.42 and Pearson R of 0.612 on the affinity prediction task, an AUC of 0.956 at binding pose classification, and a 68.4% accuracy at pose selection on the CrossDocked2020 set. By providing data splits for clustered cross-validation and the raw data for the CrossDocked2020 set, we establish the first standardized dataset for training machine learning models to recognize ligands in non-cognate target structures while also greatly expanding the number of poses available for training. In order to facilitate community adoption of this dataset for benchmarking protein-ligand binding affinity prediction, we provide our models, weights, and the CrossDocked2020 set at https://github.com/gnina/models.


A sentiment analysis using SNS data can confirm various people’s thoughts. Thus an analysis using SNS can predict social problems and more accurately identify the complex causes of the problem. In addition, big data technology can identify SNS information that is generated in real time, allowing a wide range of people’s opinions to be understood without losing time. It can supplement traditional opinion surveys. The incumbent government mainly uses SNS to promote its policies. However, measures are needed to actively reflect SNS in the process of carrying out the policy. Therefore this paper developed a sentiment classifier that can identify public feelings on SNS about climate change. To that end, based on a dictionary formulated on the theme of climate change, we collected climate change SNS data for learning and tagged seven sentiments. Using training data, the sentiment classifier models were developed using machine learning models. The analysis showed that the Bi-LSTM model had the best performance than shallow models. It showed the highest accuracy (85.10%) in the seven sentiments classified, outperforming traditional machine learning (Naive Bayes and SVM) by approximately 34.53%p, and 7.14%p respectively. These findings substantiate the applicability of the proposed Bi-LSTM-based sentiment classifier to the analysis of sentiments relevant to diverse climate change issues.


Sign in / Sign up

Export Citation Format

Share Document