Best-First Beam Search

Decoding for many NLP tasks requires an effective heuristic algorithm for approximating exact search because the problem of searching the full output space is often intractable, or impractical in many settings. The default algorithm for this job is beam search—a pruned version of breadth-first search. Quite surprisingly, beam search often returns better results than exact inference due to beneficial search bias for NLP tasks. In this work, we show that the standard implementation of beam search can be made up to 10x faster in practice. Our method assumes that the scoring function is monotonic in the sequence length, which allows us to safely prune hypotheses that cannot be in the final set of hypotheses early on. We devise effective monotonic approximations to popular nonmonontic scoring functions, including length normalization and mutual information decoding. Lastly, we propose a memory-reduced variant of best-first beam search, which has a similar beneficial search bias in terms of downstream performance, but runs in a fraction of the time.

Download Full-text

Random Forest Refinement of Pairwise Potentials for Protein-ligand Decoy Detection

10.26434/chemrxiv.8047820.v1 ◽

2019 ◽

Cited By ~ 1

Author(s):

Jun Pei ◽

Zheng Zheng ◽

Hyunji Kim ◽

Lin Song ◽

Sarah Walworth ◽

...

Keyword(s):

Machine Learning ◽

Random Forest ◽

Probability Function ◽

Pair Potential ◽

Scoring Function ◽

Stable Structure ◽

Scoring Functions ◽

Atom Pair ◽

Data Set ◽

Atom Pairs

An accurate scoring function is expected to correctly select the most stable structure from a set of pose candidates. One can hypothesize that a scoring function’s ability to identify the most stable structure might be improved by emphasizing the most relevant atom pairwise interactions. However, it is hard to evaluate the relevant importance for each atom pair using traditional means. With the introduction of machine learning methods, it has become possible to determine the relative importance for each atom pair present in a scoring function. In this work, we use the Random Forest (RF) method to refine a pair potential developed by our laboratory (GARF6) by identifying relevant atom pairs that optimize the performance of the potential on our given task. Our goal is to construct a machine learning (ML) model that can accurately differentiate the native ligand binding pose from candidate poses using a potential refined by RF optimization. We successfully constructed RF models on an unbalanced data set with the ‘comparison’ concept and, the resultant RF models were tested on CASF-2013.5 In a comparison of the performance of our RF models against 29 scoring functions, we found our models outperformed the other scoring functions in predicting the native pose. In addition, we used two artificial designed potential models to address the importance of the GARF potential in the RF models: (1) a scrambled probability function set, which was obtained by mixing up atom pairs and probability functions in GARF, and (2) a uniform probability function set, which share the same peak positions with GARF but have fixed peak heights. The results of accuracy comparison from RF models based on the scrambled, uniform, and original GARF potential clearly showed that the peak positions in the GARF potential are important while the well depths are not.

Download Full-text

ASFP (Artificial Intelligence based Scoring Function Platform): a web server for the development of customized scoring functions

Journal of Cheminformatics ◽

10.1186/s13321-021-00486-3 ◽

2021 ◽

Vol 13 (1) ◽

Author(s):

Xujun Zhang ◽

Chao Shen ◽

Xueying Guo ◽

Zhe Wang ◽

Gaoqi Weng ◽

...

Keyword(s):

High Efficiency ◽

Low Cost ◽

Pearson Correlation ◽

Scoring Function ◽

Web Server ◽

Scoring Functions ◽

Protein Ligand Interactions ◽

Prediction Module ◽

Ligand Interactions ◽

Benchmark Datasets

AbstractVirtual screening (VS) based on molecular docking has emerged as one of the mainstream technologies of drug discovery due to its low cost and high efficiency. However, the scoring functions (SFs) implemented in most docking programs are not always accurate enough and how to improve their prediction accuracy is still a big challenge. Here, we propose an integrated platform called ASFP, a web server for the development of customized SFs for structure-based VS. There are three main modules in ASFP: (1) the descriptor generation module that can generate up to 3437 descriptors for the modelling of protein–ligand interactions; (2) the AI-based SF construction module that can establish target-specific SFs based on the pre-generated descriptors through three machine learning (ML) techniques; (3) the online prediction module that provides some well-constructed target-specific SFs for VS and an additional generic SF for binding affinity prediction. Our methodology has been validated on several benchmark datasets. The target-specific SFs can achieve an average ROC AUC of 0.973 towards 32 targets and the generic SF can achieve the Pearson correlation coefficient of 0.81 on the PDBbind version 2016 core set. To sum up, the ASFP server is a powerful tool for structure-based VS.

Download Full-text

Incorporating structural similarity into a scoring function to enhance the prediction of binding affinities

Journal of Cheminformatics ◽

10.1186/s13321-021-00493-4 ◽

2021 ◽

Vol 13 (1) ◽

Author(s):

Beihong Ji ◽

Xibing He ◽

Yuzhao Zhang ◽

Jingchen Zhai ◽

Viet Hoang Man ◽

...

Keyword(s):

Computational Cost ◽

Scoring Function ◽

Structural Similarity ◽

Scoring Functions ◽

Binding Affinities ◽

Autodock Vina ◽

Predictive Index ◽

Drug Lead ◽

Screening Performance ◽

Calibration Algorithm

AbstractIn this study, we developed a novel algorithm to improve the screening performance of an arbitrary docking scoring function by recalibrating the docking score of a query compound based on its structure similarity with a set of training compounds, while the extra computational cost is neglectable. Two popular docking methods, Glide and AutoDock Vina were adopted as the original scoring functions to be processed with our new algorithm and similar improvement performance was achieved. Predicted binding affinities were compared against experimental data from ChEMBL and DUD-E databases. 11 representative drug receptors from diverse drug target categories were applied to evaluate the hybrid scoring function. The effects of four different fingerprints (FP2, FP3, FP4, and MACCS) and the four different compound similarity effect (CSE) functions were explored. Encouragingly, the screening performance was significantly improved for all 11 drug targets especially when CSE = S4 (S is the Tanimoto structural similarity) and FP2 fingerprint were applied. The average predictive index (PI) values increased from 0.34 to 0.66 and 0.39 to 0.71 for the Glide and AutoDock vina scoring functions, respectively. To evaluate the performance of the calibration algorithm in drug lead identification, we also imposed an upper limit on the structural similarity to mimic the real scenario of screening diverse libraries for which query ligands are general-purpose screening compounds and they are not necessarily structurally similar to reference ligands. Encouragingly, we found our hybrid scoring function still outperformed the original docking scoring function. The hybrid scoring function was further evaluated using external datasets for two systems and we found the PI values increased from 0.24 to 0.46 and 0.14 to 0.42 for A2AR and CFX systems, respectively. In a conclusion, our calibration algorithm can significantly improve the virtual screening performance in both drug lead optimization and identification phases with neglectable computational cost.

Download Full-text

Learning from Docked Ligands: Ligand-Based Features Rescue Structure-Based Scoring Functions When Trained On Docked Poses

10.26434/chemrxiv.13637756 ◽

2021 ◽

Author(s):

Fergus Boyles ◽

Charlotte M Deane ◽

Garrett Morris

Keyword(s):

Machine Learning ◽

Ligand Binding ◽

Crystal Structures ◽

Binding Affinity ◽

Scoring Function ◽

Scoring Functions ◽

Data Set ◽

Core Sets ◽

Strong Performance

Machine learning scoring functions for protein-ligand binding affinity have been found to consistently outperform classical scoring functions when trained and tested on crystal structures of bound protein-ligand complexes. However, it is less clear how these methods perform when applied to docked poses of complexes. We explore how the use of docked, rather than crystallographic, poses for both training and testing affects the performance of machine learning scoring functions. Using the PDBbind Core Sets as benchmarks, we show that the performance of a structure-based machine learning scoring function trained and tested on docked poses is lower than that of the same scoring function trained and tested on crystallographic poses. We construct a hybrid scoring function by combining both structure-based and ligand-based features, and show that its ability to predict binding affinity using docked poses is comparable to that of purely structure-based scoring functions trained and tested on crystal poses. Despite strong performance on docked poses of the PDBbind Core Sets, we find that our hybrid scoring function fails to generalise to anew data set, demonstrating the need for improved scoring functions and additional validation benchmarks. Code and data to reproduce our results are available from https://github.com/oxpig/learning-from-docked-poses.

Download Full-text

Selecting Machine-Learning Scoring Functions for Structure-Based Virtual Screening

10.26434/chemrxiv.12967160 ◽

2020 ◽

Author(s):

Pedro Ballester

Keyword(s):

Machine Learning ◽

Drug Discovery ◽

Virtual Screening ◽

Predictive Accuracy ◽

Scoring Function ◽

3D Models ◽

Large Datasets ◽

Scoring Functions ◽

Discovery Process ◽

Drug Discovery Process

Interest in docking technologies has grown parallel to the ever increasing number and diversity of 3D models for macromolecular therapeutic targets. Structure-Based Virtual Screening (SBVS) aims at leveraging these experimental structures to discover the necessary starting points for the drug discovery process. It is now established that Machine Learning (ML) can strongly enhance the predictive accuracy of scoring functions for SBVS by exploiting large datasets from targets, molecules and their associations. However, with greater choice, the question of which ML-based scoring function is the most suitable for prospective use on a given target has gained importance. Here we analyse two approaches to select an existing scoring function for the target along with a third approach consisting in generating a scoring function tailored to the target. These analyses required discussing the limitations of popular SBVS benchmarks, the alternatives to benchmark scoring functions for SBVS and how to generate them or use them using freely-available software.

Download Full-text

Learning Bayesian Networks Based on a Mutual Information Scoring Function and EMI Method

Advances in Neural Networks – ISNN 2007 - Lecture Notes in Computer Science ◽

10.1007/978-3-540-72393-6_50 ◽

2007 ◽

pp. 414-423 ◽

Cited By ~ 1

Author(s):

Fengzhan Tian ◽

Haisheng Li ◽

Zhihai Wang ◽

Jian Yu

Keyword(s):

Mutual Information ◽

Bayesian Networks ◽

Scoring Function

Download Full-text

Assessing Molecular Docking Tools to Guide Targeted Drug Discovery of CD38 Inhibitors

International Journal of Molecular Sciences ◽

10.3390/ijms21155183 ◽

2020 ◽

Vol 21 (15) ◽

pp. 5183 ◽

Cited By ~ 1

Author(s):

Eric D. Boittier ◽

Yat Yin Tang ◽

McKenna E. Buckley ◽

Zachariah P. Schuurs ◽

Derek J. Richard ◽

...

Keyword(s):

Scoring Function ◽

Pose Prediction ◽

Scoring Functions ◽

Molecular Fingerprints ◽

Biologically Relevant ◽

Protein Ligand Interactions ◽

Molecular Features ◽

Ligand Interactions ◽

Model Protein ◽

Scoring Accuracy

A promising protein target for computational drug development, the human cluster of differentiation 38 (CD38), plays a crucial role in many physiological and pathological processes, primarily through the upstream regulation of factors that control cytoplasmic Ca2+ concentrations. Recently, a small-molecule inhibitor of CD38 was shown to slow down pathways relating to aging and DNA damage. We examined the performance of seven docking programs for their ability to model protein-ligand interactions with CD38. A test set of twelve CD38 crystal structures, containing crystallized biologically relevant substrates, were used to assess pose prediction. The rankings for each program based on the median RMSD between the native and predicted were Vina, AD4 > PLANTS, Gold, Glide, Molegro > rDock. Forty-two compounds with known affinities were docked to assess the accuracy of the programs at affinity/ranking predictions. The rankings based on scoring power were: Vina, PLANTS > Glide, Gold > Molegro >> AutoDock 4 >> rDock. Out of the top four performing programs, Glide had the only scoring function that did not appear to show bias towards overpredicting the affinity of the ligand-based on its size. Factors that affect the reliability of pose prediction and scoring are discussed. General limitations and known biases of scoring functions are examined, aided in part by using molecular fingerprints and Random Forest classifiers. This machine learning approach may be used to systematically diagnose molecular features that are correlated with poor scoring accuracy.

Download Full-text

MTD-PLS and docking study for a series of substituted 2-phenylindole derivatives with oestrogenic activity

Chemical Papers ◽

10.2478/s11696-011-0040-3 ◽

2011 ◽

Vol 65 (4) ◽

Author(s):

Edward Seclaman ◽

Alina Bora ◽

Sorin Avram ◽

Zeno Simon ◽

Ludovic Kurunczi

Keyword(s):

Oestrogen Receptor ◽

Scoring Function ◽

Docking Study ◽

Scoring Functions ◽

X Ray Diffraction ◽

X Ray ◽

X Ray Crystallography ◽

Receptor Complexes ◽

Test Sets ◽

Latent Structures

AbstractA series of 36 substituted 2-phenylindoles was analysed using minimal topological difference-projections in latent structures variant (MTD-PLS) and molecular docking, using fast rigid exhaustive docking (FRED) and AutoDock Vina programs. For quantitative structure activity relationships (QSAR) validation, a sphere exclusion algorithm in the multi-dimensional descriptor space was used to construct training and test sets. Docking procedures were based on X-ray crystallography studies using the human alpha oestrogen receptor-17β-oestradiol complex. The ranking abilities of the different scoring functions of the FRED package were presented, and the most suitable scoring function (Chemgauss3) for the oestrogen receptor was chosen. Although the series studied contains only a limited number of compounds, the MTD-PLS method and the docking procedure provided coherent results in concordance with the X-ray diffraction data for different ligand-oestrogen receptor complexes.

Download Full-text

Products of weighted logic programs

Theory and Practice of Logic Programming ◽

10.1017/s1471068410000529 ◽

2011 ◽

Vol 11 (2-3) ◽

pp. 263-296 ◽

Cited By ~ 1

Author(s):

SHAY B. COHEN ◽

ROBERT J. SIMMONS ◽

NOAH A. SMITH

Keyword(s):

Machine Learning ◽

Dynamic Programming ◽

Logic Programming ◽

Scoring Function ◽

Logic Programs ◽

Information Theoretic ◽

Optimal Score ◽

Leibler Divergence ◽

Programming Algorithms ◽

Output Space

AbstractWeighted logic programming, a generalization of bottom-up logic programming, is a well-suited framework for specifying dynamic programming algorithms. In this setting, proofs correspond to the algorithm's output space, such as a path through a graph or a grammatical derivation, and are given a real-valued score (often interpreted as a probability) that depends on the real weights of the base axioms used in the proof. The desired output is a function over all possible proofs, such as a sum of scores or an optimal score. We describe the product transformation, which can merge two weighted logic programs into a new one. The resulting program optimizes a product of proof scores from the original programs, constituting a scoring function known in machine learning as a “product of experts.” Through the addition of intuitive constraining side conditions, we show that several important dynamic programming algorithms can be derived by applying product to weighted logic programs corresponding to simpler weighted logic programs. In addition, we show how the computation of Kullback–Leibler divergence, an information-theoretic measure, can be interpreted using product.

Download Full-text

CSCORE: A SIMPLE YET EFFECTIVE SCORING FUNCTION FOR PROTEIN–LIGAND BINDING AFFINITY PREDICTION USING MODIFIED CMAC LEARNING ARCHITECTURE

Journal of Bioinformatics and Computational Biology ◽

10.1142/s021972001100577x ◽

2011 ◽

Vol 09 (supp01) ◽

pp. 1-14 ◽

Cited By ~ 20

Author(s):

XUCHANG OUYANG ◽

STEPHANUS DANIEL HANDOKO ◽

CHEE KEONG KWOH

Keyword(s):

Binding Affinity ◽

Scoring Function ◽

Binding Mode ◽

Computational Method ◽

Data Driven ◽

Machine Learning Techniques ◽

Ligand Docking ◽

Scoring Functions ◽

Binding Affinity Prediction ◽

Affinity Prediction

Protein–ligand docking is a computational method to identify the binding mode of a ligand and a target protein, and predict the corresponding binding affinity using a scoring function. This method has great value in drug design. After decades of development, scoring functions nowadays typically can identify the true binding mode, but the prediction of binding affinity still remains a major problem. Here we present CScore, a data-driven scoring function using a modified Cerebellar Model Articulation Controller (CMAC) learning architecture, for accurate binding affinity prediction. The performance of CScore in terms of correlation between predicted and experimental binding affinities is benchmarked under different validation approaches. CScore achieves a prediction with R = 0.7668 and RMSE = 1.4540 when tested on an independent dataset. To the best of our knowledge, this result outperforms other scoring functions tested on the same dataset. The performance of CScore varies on different clusters under the leave-cluster-out validation approach, but still achieves competitive result. Lastly, the target-specified CScore achieves an even better result with R = 0.8237 and RMSE = 1.0872, trained on a much smaller but more relevant dataset for each target. The large dataset of protein–ligand complexes structural information and advances of machine learning techniques enable the data-driven approach in binding affinity prediction. CScore is capable of accurate binding affinity prediction. It is also shown that CScore will perform better if sufficient and relevant data is presented. As there is growth of publicly available structural data, further improvement of this scoring scheme can be expected.

Download Full-text