scholarly journals Machine learning classification can reduce false positives in structure-based virtual screening

2020 ◽  
Vol 117 (31) ◽  
pp. 18477-18488 ◽  
Author(s):  
Yusuf O. Adeshina ◽  
Eric J. Deeds ◽  
John Karanicolas

With the recent explosion in the size of libraries available for screening, virtual screening is positioned to assume a more prominent role in early drug discovery’s search for active chemical matter. In typical virtual screens, however, only about 12% of the top-scoring compounds actually show activity when tested in biochemical assays. We argue that most scoring functions used for this task have been developed with insufficient thoughtfulness into the datasets on which they are trained and tested, leading to overly simplistic models and/or overtraining. These problems are compounded in the literature because studies reporting new scoring methods have not validated their models prospectively within the same study. Here, we report a strategy for building a training dataset (D-COID) that aims to generate highly compelling decoy complexes that are individually matched to available active complexes. Using this dataset, we train a general-purpose classifier for virtual screening (vScreenML) that is built on the XGBoost framework. In retrospective benchmarks, our classifier shows outstanding performance relative to other scoring functions. In a prospective context, nearly all candidate inhibitors from a screen against acetylcholinesterase show detectable activity; beyond this, 10 of 23 compounds have IC50better than 50 μM. Without any medicinal chemistry optimization, the most potent hit has IC50280 nM, corresponding toKiof 173 nM. These results support using the D-COID strategy for training classifiers in other computational biology tasks, and for vScreenML in virtual screening campaigns against other protein targets. Both D-COID and vScreenML are freely distributed to facilitate such efforts.

Author(s):  
Yusuf Adeshina ◽  
Eric Deeds ◽  
John Karanicolas

AbstractWith the recent explosion in the size of libraries available for screening, virtual screening is positioned to assume a more prominent role in early drug discovery’s search for active chemical matter. Modern virtual screening methods are still, however, plagued with high false positive rates: typically, only about 12% of the top-scoring compounds actually show activity when tested in biochemical assays. We argue that most scoring functions used for this task have been developed with insufficient thoughtfulness into the datasets on which they are trained and tested, leading to overly simplistic models and/or overtraining. These problems are compounded in the literature because none of the studies reporting new scoring methods have validated their model prospectively within the same study. Here, we report a new strategy for building a training dataset (D-COID) that aims to generate highly-compelling decoy complexes that are individually matched to available active complexes. Using this dataset, we train a general-purpose classifier for virtual screening (vScreenML) that is built on the XGBoost framework of gradient-boosted decision trees. In retrospective benchmarks, our new classifier shows outstanding performance relative to other scoring functions. We additionally evaluate the classifier in a prospective context, by screening for new acetylcholinesterase inhibitors. Remarkably, we find that nearly all compounds selected by vScreenML show detectable activity at 50 µM, with 10 of 23 providing greater than 50% inhibition at this concentration. Without any medicinal chemistry optimization, the most potent hit from this initial screen has an IC50 of 280 nM, corresponding to a Ki value of 173 nM. These results support using the D-COID strategy for training classifiers in other computational biology tasks, and for vScreenML in virtual screening campaigns against other protein targets. Both D-COID and vScreenML are freely distributed to facilitate such efforts.


2020 ◽  
Vol 21 (14) ◽  
pp. 5152 ◽  
Author(s):  
Silvia Gervasoni ◽  
Giulio Vistoli ◽  
Carmine Talarico ◽  
Candida Manelfi ◽  
Andrea R. Beccari ◽  
...  

(1) Background: Virtual screening studies on the therapeutically relevant proteins of the severe acute respiratory syndrome Coronavirus 2 (SARS-CoV-2) require a detailed characterization of their druggable binding sites, and, more generally, a convenient pocket mapping represents a key step for structure-based in silico studies; (2) Methods: Along with a careful literature search on SARS-CoV-2 protein targets, the study presents a novel strategy for pocket mapping based on the combination of pocket (as performed by the well-known FPocket tool) and docking searches (as performed by PLANTS or AutoDock/Vina engines); such an approach is implemented by the Pockets 2.0 plug-in for the VEGA ZZ suite of programs; (3) Results: The literature analysis allowed the identification of 16 promising binding cavities within the SARS-CoV-2 proteins and the here proposed approach was able to recognize them showing performances clearly better than those reached by the sole pocket detection; and (4) Conclusions: Even though the presented strategy should require more extended validations, this proved successful in precisely characterizing a set of SARS-CoV-2 druggable binding pockets including both orthosteric and allosteric sites, which are clearly amenable for virtual screening campaigns and drug repurposing studies. All results generated by the study and the Pockets 2.0 plug-in are available for download.


2021 ◽  
Author(s):  
Oscar Méndez-Lucio ◽  
Mazen Ahmad ◽  
Ehecatl Antonio del Rio-Chanona ◽  
Jörg Kurt Wegner

Understanding the interactions formed between a ligand and its molecular target is key to guide the optimization of molecules. Different experimental and computational methods have been key to understand better these intermolecular interactions. Herein, we report a method based on geometric deep learning that is capable of predicting the binding conformations of ligands to protein targets. Concretely, the model learns a statistical potential based on distance likelihood which is tailor-made for each ligand-target pair. This potential can be coupled with global optimization algorithms to reproduce experimental binding conformations of ligands. We show that the potential based on distance likelihood described in this paper performs similar or better than well-established scoring functions for docking and screening tasks. Overall, this method represents an example of how artificial intelligence can be used to improve structure-based drug design.


2021 ◽  
Author(s):  
Oscar Méndez-Lucio ◽  
Mazen Ahmad ◽  
Ehecatl Antonio del Rio-Chanona ◽  
Jörg Kurt Wegner

Understanding the interactions formed between a ligand and its molecular target is key to guide the optimization of molecules. Different experimental and computational methods have been key to understand better these intermolecular interactions. Herein, we report a method based on geometric deep learning that is capable of predicting the binding conformations of ligands to protein targets. Concretely, the model learns a statistical potential based on distance likelihood which is tailor-made for each ligand-target pair. This potential can be coupled with global optimization algorithms to reproduce experimental binding conformations of ligands. We show that the potential based on distance likelihood described in this paper performs similar or better than well-established scoring functions for docking and screening tasks. Overall, this method represents an example of how artificial intelligence can be used to improve structure-based drug design.


2017 ◽  
Vol 22 (8) ◽  
pp. 995-1006 ◽  
Author(s):  
Dante A. Pertusi ◽  
Gregory O’Donnell ◽  
Michelle F. Homsher ◽  
Kelli Solly ◽  
Amita Patel ◽  
...  

High-throughput screening (HTS) is a widespread method in early drug discovery for identifying promising chemical matter that modulates a target or phenotype of interest. Because HTS campaigns involve screening millions of compounds, it is often desirable to initiate screening with a subset of the full collection. Subsequently, virtual screening methods prioritize likely active compounds in the remaining collection in an iterative process. With this approach, orthogonal virtual screening methods are often applied, necessitating the prioritization of hits from different approaches. Here, we introduce a novel method of fusing these prioritizations and benchmark it prospectively on 17 screening campaigns using virtual screening methods in three descriptor spaces. We found that the fusion approach retrieves 15% to 65% more active chemical series than any single machine-learning method and that appropriately weighting contributions of similarity and machine-learning scoring techniques can increase enrichment by 1% to 19%. We also use fusion scoring to evaluate the tradeoff between screening more chemical matter initially in lieu of replicate samples to prevent false-positives and find that the former option leads to the retrieval of more active chemical series. These results represent guidelines that can increase the rate of identification of promising active compounds in future iterative screens.


2020 ◽  
Vol 10 (1) ◽  
Author(s):  
Steven P. D. Harborne ◽  
Jannik Strauss ◽  
Jessica C. Boakes ◽  
Danielle L. Wright ◽  
James G. Henderson ◽  
...  

Abstract Identifying stabilising variants of membrane protein targets is often required for structure determination. Our new computational pipeline, the Integral Membrane Protein Stability Selector (IMPROvER) provides a rational approach to variant selection by employing three independent approaches: deep-sequence, model-based and data-driven. In silico tests using known stability data, and in vitro tests using three membrane protein targets with 7, 11 and 16 transmembrane helices provided measures of success. In vitro, individual approaches alone all identified stabilising variants at a rate better than expected by random selection. Low numbers of overlapping predictions between approaches meant a greater success rate was achieved (fourfold better than random) when approaches were combined and selections restricted to the highest ranked sites. The mix of information IMPROvER uses can be extracted for any helical membrane protein. We have developed the first general-purpose tool for selecting stabilising variants of $$\upalpha$$ α -helical membrane proteins, increasing efficiency and reducing workload. IMPROvER can be accessed at http://improver.ddns.net/IMPROvER/.


2018 ◽  
Vol 2018 ◽  
pp. 1-10 ◽  
Author(s):  
Jamal Shamsara

Rescoring is a simple approach that theoretically could improve the original docking results. In this study AutoDock Vina was used as a docked engine and three other scoring functions besides the original scoring function, Vina, as well as their combinations as consensus scoring functions were employed to explore the effect of rescoring on virtual screenings that had been done on diverse targets. Rescoring by DrugScore produces the most number of cases with significant changes in screening power. Thus, the DrugScore results were used to build a simple model based on two binding site descriptors that could predict possible improvement by DrugScore rescoring. Furthermore, generally the screening power of all rescoring approach as well as original AutoDock Vina docking results correlated with the Maximum Theoretical Shape Complementarity (MTSC) and Maximum Distance from Center of Mass and all Alpha spheres (MDCMA). Therefore, it was suggested that, with a more complete set of binding site descriptors, it could be possible to find robust relationship between binding site descriptors and response to certain molecular docking programs and scoring functions. The results could be helpful for future researches aiming to do a virtual screening using AutoDock Vina and/or rescoring using DrugScore.


2019 ◽  
Vol 24 (34) ◽  
pp. 4013-4022 ◽  
Author(s):  
Xiang Cheng ◽  
Xuan Xiao ◽  
Kuo-Chen Chou

Knowledge of protein subcellular localization is vitally important for both basic research and drug development. With the avalanche of protein sequences emerging in the post-genomic age, it is highly desired to develop computational tools for timely and effectively identifying their subcellular localization based on the sequence information alone. Recently, a predictor called “pLoc-mPlant” was developed for identifying the subcellular localization of plant proteins. Its performance is overwhelmingly better than that of the other predictors for the same purpose, particularly in dealing with multi-label systems in which some proteins, called “multiplex proteins”, may simultaneously occur in two or more subcellular locations. Although it is indeed a very powerful predictor, more efforts are definitely needed to further improve it. This is because pLoc-mPlant was trained by an extremely skewed dataset in which some subsets (i.e., the protein numbers for some subcellular locations) were more than 10 times larger than the others. Accordingly, it cannot avoid the biased consequence caused by such an uneven training dataset. To overcome such biased consequence, we have developed a new and bias-free predictor called pLoc_bal-mPlant by balancing the training dataset. Cross-validation tests on exactly the same experimentconfirmed dataset have indicated that the proposed new predictor is remarkably superior to pLoc-mPlant, the existing state-of-the-art predictor in identifying the subcellular localization of plant proteins. To maximize the convenience for the majority of experimental scientists, a user-friendly web-server for the new predictor has been established at http://www.jci-bioinfo.cn/pLoc_bal-mPlant/, by which users can easily get their desired results without the need to go through the detailed mathematics.


2019 ◽  
Vol 15 (5) ◽  
pp. 472-485 ◽  
Author(s):  
Kuo-Chen Chou ◽  
Xiang Cheng ◽  
Xuan Xiao

<P>Background/Objective: Information of protein subcellular localization is crucially important for both basic research and drug development. With the explosive growth of protein sequences discovered in the post-genomic age, it is highly demanded to develop powerful bioinformatics tools for timely and effectively identifying their subcellular localization purely based on the sequence information alone. Recently, a predictor called “pLoc-mEuk” was developed for identifying the subcellular localization of eukaryotic proteins. Its performance is overwhelmingly better than that of the other predictors for the same purpose, particularly in dealing with multi-label systems where many proteins, called “multiplex proteins”, may simultaneously occur in two or more subcellular locations. Although it is indeed a very powerful predictor, more efforts are definitely needed to further improve it. This is because pLoc-mEuk was trained by an extremely skewed dataset where some subset was about 200 times the size of the other subsets. Accordingly, it cannot avoid the biased consequence caused by such an uneven training dataset. </P><P> Methods: To alleviate such bias, we have developed a new predictor called pLoc_bal-mEuk by quasi-balancing the training dataset. Cross-validation tests on exactly the same experimentconfirmed dataset have indicated that the proposed new predictor is remarkably superior to pLocmEuk, the existing state-of-the-art predictor in identifying the subcellular localization of eukaryotic proteins. It has not escaped our notice that the quasi-balancing treatment can also be used to deal with many other biological systems. </P><P> Results: To maximize the convenience for most experimental scientists, a user-friendly web-server for the new predictor has been established at http://www.jci-bioinfo.cn/pLoc_bal-mEuk/. </P><P> Conclusion: It is anticipated that the pLoc_bal-Euk predictor holds very high potential to become a useful high throughput tool in identifying the subcellular localization of eukaryotic proteins, particularly for finding multi-target drugs that is currently a very hot trend trend in drug development.</P>


Molecules ◽  
2021 ◽  
Vol 26 (9) ◽  
pp. 2600
Author(s):  
Fábio G. Martins ◽  
André Melo ◽  
Sérgio F. Sousa

Biofilms are aggregates of microorganisms anchored to a surface and embedded in a self-produced matrix of extracellular polymeric substances and have been associated with 80% of all bacterial infections in humans. Because bacteria in biofilms are less amenable to antibiotic treatment, biofilms have been associated with developing antibiotic resistance, a problem that urges developing new therapeutic options and approaches. Interfering with quorum-sensing (QS), an important process of cell-to-cell communication by bacteria in biofilms is a promising strategy to inhibit biofilm formation and development. Here we describe and apply an in silico computational protocol for identifying novel potential inhibitors of quorum-sensing, using CviR—the quorum-sensing receptor from Chromobacterium violaceum—as a model target. This in silico approach combines protein-ligand docking (with 7 different docking programs/scoring functions), receptor-based virtual screening, molecular dynamic simulations, and free energy calculations. Particular emphasis was dedicated to optimizing the discrimination ability between active/inactive molecules in virtual screening tests using a target-specific training set. Overall, the optimized protocol was used to evaluate 66,461 molecules, including those on the ZINC/FDA-Approved database and to the Mu.Ta.Lig Virtual Chemotheca. Multiple promising compounds were identified, yielding good prospects for future experimental validation and for drug repurposing towards QS inhibition.


Sign in / Sign up

Export Citation Format

Share Document