scholarly journals V-Dock: Fast Generation of Novel Drug-Like Molecules Using Machine-Learning-Based Docking Score and Molecular Optimization

2021 ◽  
Vol 22 (21) ◽  
pp. 11635
Author(s):  
Jieun Choi ◽  
Juyong Lee

We propose a computational workflow to design novel drug-like molecules by combining the global optimization of molecular properties and protein-ligand docking with machine learning. However, most existing methods depend heavily on experimental data, and many targets do not have sufficient data to train reliable activity prediction models. To overcome this limitation, protein-ligand docking calculations must be performed using the limited data available. Such docking calculations during molecular generation require considerable computational time, preventing extensive exploration of the chemical space. To address this problem, we trained a machine-learning-based model that predicted the docking energy using SMILES to accelerate the molecular generation process. Docking scores could be accurately predicted using only a SMILES string. We combined this docking score prediction model with the global molecular property optimization approach, MolFinder, to find novel molecules exhibiting the desired properties with high values of predicted docking scores. We named this design approach V-dock. Using V-dock, we efficiently generated many novel molecules with high docking scores for a target protein, a similarity to the reference molecule, and desirable drug-like and bespoke properties, such as QED. The predicted docking scores of the generated molecules were verified by correlating them with the actual docking scores.

2021 ◽  
Author(s):  
Jieun Choi ◽  
Juyong Lee

In this work, we propose a novel drug-like molecular design workflow by combining an efficient global molecular property optimization, protein-ligand molecular docking, and machine learning. Computational drug design algorithms aim to find novel molecules satisfying various drug-like properties and have a strong binding affinity between a protein and a ligand. To accomplish this goal, various computational molecular generation methods have been developed with recent advances in deep learning and the increase of biological data. However, most existing methods heavily depend on experimental activity data, which are not available for many targets. Thus, when the number of available activity data is limited, protein-ligand docking calculations should be used. However, performing a docking calculation during molecular generation on the fly requires considerable computational resources. To address this problem, we used machine-learning models predicting docking energy to accelerate the molecular generation process. We combined this ML-assisted docking score prediction model with the efficient global molecular property optimization approach, MolFinder. We call this design approach V-dock. Using the V-dock approach, we quickly generated many molecules with high docking scores for a target protein and desirable drug-like and bespoke properties, such as similarity to a reference molecule.


ADMET & DMPK ◽  
2020 ◽  
Vol 8 (1) ◽  
pp. 29-77 ◽  
Author(s):  
Alex Avdeef

The accurate prediction of solubility of drugs is still problematic. It was thought for a long time that shortfalls had been due the lack of high-quality solubility data from the chemical space of drugs. This study considers the quality of solubility data, particularly of ionizable drugs. A database is described, comprising 6355 entries of intrinsic solubility for 3014 different molecules, drawing on 1325 citations. In an earlier publication, many factors affecting the quality of the measurement had been discussed, and suggestions were offered to improve ways of extracting more reliable information from legacy data. Many of the suggestions have been implemented in this study. By correcting solubility for ionization (i.e., deriving intrinsic solubility, S0) and by normalizing temperature (by transforming measurements performed in the range 10-50 °C to 25 °C), it can now be estimated that the average interlaboratory reproducibility is 0.17 log unit. Empirical methods to predict solubility at best have hovered around the root mean square error (RMSE) of 0.6 log unit. Three prediction methods are compared here: (a) Yalkowsky’s general solubility equation (GSE), (b) Abraham solvation equation (ABSOLV), and (c) Random Forest regression (RFR) statistical machine learning. The latter two methods were trained using the new database. The RFR method outperforms the other two models, as anticipated. However, the ability to predict the solubility of drugs to the level of the quality of data is still out of reach. The data quality is not the limiting factor in prediction. The statistical machine learning methodologies are probably up to the task. Possibly what’s missing are solubility data from a few sparsely-covered chemical space of drugs (particularly of research compounds). Also, new descriptors which can better differentiate the factors affecting solubility between molecules could be critical for narrowing the gap between the accuracy of the prediction models and that of the experimental data.


2021 ◽  
Author(s):  
Sarvesh Mehta ◽  
Siddhartha Laghuvarapu ◽  
Yashaswi Pathak ◽  
Aaftaab Sethi ◽  
Mallika Alvala ◽  
...  

<div>In drug discovery applications, high throughput virtual screening exercises are routinely performed to determine an initial set of candidate molecules referred to as "hits". In such an experiment, each molecule from large small-molecule drug library is evaluated for physical property such as the binding affinity (docking score) against a target receptor. In real-life drug discovery experiments, the drug libraries are extremely large but still a minor representation of the essentially infinite chemical space , and evaluation of physical property for each molecule in the library is not computationally feasible. </div><div>In the current study, a novel machine learning framework "MEMES" based on Bayesian optimization is proposed for efficient sampling of chemical space. The proposed framework is demonstrated to identify 90% of top-1000 molecules from a molecular library of size about 100 million, while calculating the docking score only for about 6% of the complete library. We believe that such a framework would tremendously help to reduce the computational hour and resources in not only drug-discovery but also areas that require such high-throughput experiments.</div>


2019 ◽  
Vol 141 (11) ◽  
Author(s):  
Matthew E. Lynch ◽  
Soumalya Sarkar ◽  
Kurt Maute

Abstract Recent advances in design optimization have significant potential to improve the function of mechanical components and systems. Coupled with additive manufacturing, topology optimization is one category of numerical methods used to produce algorithmically generated optimized designs making a difference in the mechanical design of hardware currently being introduced to the market. Unfortunately, many of these algorithms require extensive manual setup and control, particularly of tuning parameters that control algorithmic function and convergence. This paper introduces a framework based on machine learning approaches to recommend tuning parameters to a user in order to avoid costly trial and error involved in manual tuning. The algorithm reads tuning parameters from a repository of prior, similar problems adjudged using a dissimilarity metric based on problem metadata and refines them for the current problem using a Bayesian optimization approach. The approach is demonstrated for a simple topology optimization problem with the objective of achieving good topology optimization solution quality and then with the additional objective of finding an optimal “trade” between solution quality and required computational time. The goal is to reduce the total number of “wasted” tuning runs that would be required for purely manual tuning. With more development, the framework may ultimately be useful on an enterprise level for analysis and optimization problems—topology optimization is one example but the framework is also applicable to other optimization problems such as shape and sizing and in high-fidelity physics-based analysis models—and enable these types of advanced approaches to be used more efficiently.


2021 ◽  
Author(s):  
Sarvesh Mehta ◽  
Siddhartha Laghuvarapu ◽  
Yashaswi Pathak ◽  
Aaftaab Sethi ◽  
Mallika Alvala ◽  
...  

<div>In drug discovery applications, high throughput virtual screening exercises are routinely performed to determine an initial set of candidate molecules referred to as "hits". In such an experiment, each molecule from large small-molecule drug library is evaluated for physical property such as the binding affinity (docking score) against a target receptor. In real-life drug discovery experiments, the drug libraries are extremely large but still a minor representation of the essentially infinite chemical space , and evaluation of physical property for each molecule in the library is not computationally feasible. </div><div>In the current study, a novel machine learning framework "MEMES" based on Bayesian optimization is proposed for efficient sampling of chemical space. The proposed framework is demonstrated to identify 90% of top-1000 molecules from a molecular library of size about 100 million, while calculating the docking score only for about 6% of the complete library. We believe that such a framework would tremendously help to reduce the computational hour and resources in not only drug-discovery but also areas that require such high-throughput experiments.</div>


Molecules ◽  
2020 ◽  
Vol 25 (9) ◽  
pp. 2198
Author(s):  
Ozren Jović ◽  
Tomislav Šmuc

Novel machine learning and molecular modelling filtering procedures for drug repurposing have been carried out for the recognition of the novel fungicide targets of Cyp51 and Erg2. Classification and regression approaches on molecular descriptors have been performed using stepwise multilinear regression (FS-MLR), uninformative-variable elimination partial-least square regression, and a non-linear method called Forward Stepwise Limited Correlation Random Forest (FS-LM-RF). Altogether, 112 prediction models from two different approaches have been built for the descriptor recognition of fungicide hit compounds. Aiming at the fungal targets of sterol biosynthesis in membranes, antifungal hit compounds have been selected for docking experiments from the Drugbank database using the Autodock4 molecular docking program. The results were verified by Gold Protein-Ligand Docking Software. The best-docked conformation, for each high-scored ligand considered, was submitted to quantum mechanics/molecular mechanics (QM/MM) gradient optimization with final single point calculations taking into account both the basis set superposition error and thermal corrections (with frequency calculations). Finally, seven Drugbank lead compounds were selected based on their high QM/MM scores for the Cyp51 target, and three were selected for the Erg2 target. These lead compounds could be recommended for further in vitro studies.


2019 ◽  
Author(s):  
Seoin Back ◽  
Kevin Tran ◽  
Zachary Ulissi

<div> <div> <div> <div><p>Developing active and stable oxygen evolution catalysts is a key to enabling various future energy technologies and the state-of-the-art catalyst is Ir-containing oxide materials. Understanding oxygen chemistry on oxide materials is significantly more complicated than studying transition metal catalysts for two reasons: the most stable surface coverage under reaction conditions is extremely important but difficult to understand without many detailed calculations, and there are many possible active sites and configurations on O* or OH* covered surfaces. We have developed an automated and high-throughput approach to solve this problem and predict OER overpotentials for arbitrary oxide surfaces. We demonstrate this for a number of previously-unstudied IrO2 and IrO3 polymorphs and their facets. We discovered that low index surfaces of IrO2 other than rutile (110) are more active than the most stable rutile (110), and we identified promising active sites of IrO2 and IrO3 that outperform rutile (110) by 0.2 V in theoretical overpotential. Based on findings from DFT calculations, we pro- vide catalyst design strategies to improve catalytic activity of Ir based catalysts and demonstrate a machine learning model capable of predicting surface coverages and site activity. This work highlights the importance of investigating unexplored chemical space to design promising catalysts.<br></p></div></div></div></div><div><div><div> </div> </div> </div>


2019 ◽  
Author(s):  
Oskar Flygare ◽  
Jesper Enander ◽  
Erik Andersson ◽  
Brjánn Ljótsson ◽  
Volen Z Ivanov ◽  
...  

**Background:** Previous attempts to identify predictors of treatment outcomes in body dysmorphic disorder (BDD) have yielded inconsistent findings. One way to increase precision and clinical utility could be to use machine learning methods, which can incorporate multiple non-linear associations in prediction models. **Methods:** This study used a random forests machine learning approach to test if it is possible to reliably predict remission from BDD in a sample of 88 individuals that had received internet-delivered cognitive behavioral therapy for BDD. The random forest models were compared to traditional logistic regression analyses. **Results:** Random forests correctly identified 78% of participants as remitters or non-remitters at post-treatment. The accuracy of prediction was lower in subsequent follow-ups (68%, 66% and 61% correctly classified at 3-, 12- and 24-month follow-ups, respectively). Depressive symptoms, treatment credibility, working alliance, and initial severity of BDD were among the most important predictors at the beginning of treatment. By contrast, the logistic regression models did not identify consistent and strong predictors of remission from BDD. **Conclusions:** The results provide initial support for the clinical utility of machine learning approaches in the prediction of outcomes of patients with BDD. **Trial registration:** ClinicalTrials.gov ID: NCT02010619.


2020 ◽  
Author(s):  
Sina Faizollahzadeh Ardabili ◽  
Amir Mosavi ◽  
Pedram Ghamisi ◽  
Filip Ferdinand ◽  
Annamaria R. Varkonyi-Koczy ◽  
...  

Several outbreak prediction models for COVID-19 are being used by officials around the world to make informed-decisions and enforce relevant control measures. Among the standard models for COVID-19 global pandemic prediction, simple epidemiological and statistical models have received more attention by authorities, and they are popular in the media. Due to a high level of uncertainty and lack of essential data, standard models have shown low accuracy for long-term prediction. Although the literature includes several attempts to address this issue, the essential generalization and robustness abilities of existing models needs to be improved. This paper presents a comparative analysis of machine learning and soft computing models to predict the COVID-19 outbreak as an alternative to SIR and SEIR models. Among a wide range of machine learning models investigated, two models showed promising results (i.e., multi-layered perceptron, MLP, and adaptive network-based fuzzy inference system, ANFIS). Based on the results reported here, and due to the highly complex nature of the COVID-19 outbreak and variation in its behavior from nation-to-nation, this study suggests machine learning as an effective tool to model the outbreak. This paper provides an initial benchmarking to demonstrate the potential of machine learning for future research. Paper further suggests that real novelty in outbreak prediction can be realized through integrating machine learning and SEIR models.


2019 ◽  
Vol 21 (9) ◽  
pp. 662-669 ◽  
Author(s):  
Junnan Zhao ◽  
Lu Zhu ◽  
Weineng Zhou ◽  
Lingfeng Yin ◽  
Yuchen Wang ◽  
...  

Background: Thrombin is the central protease of the vertebrate blood coagulation cascade, which is closely related to cardiovascular diseases. The inhibitory constant Ki is the most significant property of thrombin inhibitors. Method: This study was carried out to predict Ki values of thrombin inhibitors based on a large data set by using machine learning methods. Taking advantage of finding non-intuitive regularities on high-dimensional datasets, machine learning can be used to build effective predictive models. A total of 6554 descriptors for each compound were collected and an efficient descriptor selection method was chosen to find the appropriate descriptors. Four different methods including multiple linear regression (MLR), K Nearest Neighbors (KNN), Gradient Boosting Regression Tree (GBRT) and Support Vector Machine (SVM) were implemented to build prediction models with these selected descriptors. Results: The SVM model was the best one among these methods with R2=0.84, MSE=0.55 for the training set and R2=0.83, MSE=0.56 for the test set. Several validation methods such as yrandomization test and applicability domain evaluation, were adopted to assess the robustness and generalization ability of the model. The final model shows excellent stability and predictive ability and can be employed for rapid estimation of the inhibitory constant, which is full of help for designing novel thrombin inhibitors.


Sign in / Sign up

Export Citation Format

Share Document