MetaScore: A novel machine-learning based approach to improve traditional scoring functions for scoring protein-protein docking conformations

Mapping Intimacies ◽

10.1101/2021.10.06.463442 ◽

2021 ◽

Author(s):

Yong Jung ◽

Cunliang Geng ◽

Alexandre M. J. J. Bonvin ◽

Li C Xue ◽

Vasant G Honavar

Keyword(s):

Machine Learning ◽

Protein Interactions ◽

Chemical Properties ◽

Protein Docking ◽

Structural Basis ◽

Large Set ◽

Scoring Functions ◽

Computational Docking ◽

3D Structures ◽

Over The Top

Protein-protein interactions play a ubiquitous role in biological function. Knowledge of the three-dimensional (3D) structures of the complexes they form is essential for understanding the structural basis of those interactions and how they orchestrate key cellular processes. Computational docking has become an indispensable alternative to the expensive and time-consuming experimental approaches for determining 3D structures of protein complexes. Despite recent progress, identifying near-native models from a large set of conformations sampled by docking - the so-called scoring problem - still has considerable room for improvement. We present here MetaScore, a new machine-learning based approach to improve the scoring of docked conformations. MetaScore utilizes a random forest (RF) classifier trained to distinguish near-native from non-native conformations using a rich set of features extracted from the respective protein-protein interfaces. These include physico-chemical properties, energy terms, interaction propensity-based features, geometric properties, interface topology features, evolutionary conservation and also scores produced by traditional scoring functions (SFs). MetaScore scores docked conformations by simply averaging of the score produced by the RF classifier with that produced by any traditional SF. We demonstrate that (i) MetaScore consistently outperforms each of nine traditional SFs included in this work in terms of success rate and hit rate evaluated over the top 10 predicted conformations; (ii) An ensemble method, MetaScore-Ensemble, that combines 10 variants of MetaScore obtained by combining the RF score with each of the traditional SFs outperforms each of the MetaScore variants. We conclude that the performance of traditional SFs can be improved upon by judiciously leveraging machine-learning.

Download Full-text

iScore: A novel graph kernel-based function for scoring protein-protein docking models

10.1101/498584 ◽

2018 ◽

Cited By ~ 2

Author(s):

Cunliang Geng ◽

Yong Jung ◽

Nicolas Renaud ◽

Vasant Honavar ◽

Alexandre M.J.J. Bonvin ◽

...

Keyword(s):

Protein Complexes ◽

Protein Docking ◽

Graph Representation ◽

Structural Basis ◽

Data Sets ◽

Scoring Functions ◽

Graph Kernel ◽

Computational Docking ◽

3D Structures ◽

Protein Interfaces

ABSTRACTProtein complexes play a central role in many aspects of biological function. Knowledge of the three-dimensional (3D) structures of protein complexes is critical for gaining insights into the structural basis of interactions and their roles in the biomolecular pathways that orchestrate key cellular processes. Because of the expense and effort associated with experimental determination of 3D structures of protein complexes, computational docking has evolved as a valuable tool to predict the 3D structures of biomolecular complexes. Despite recent progress, reliably distinguishing near-native docking conformations from a large number of candidate conformations, the so-called scoring problem, remains a major challenge. Here we present iScore, a novel approach to scoring docked conformations that combines HADDOCK energy terms with a score obtained using a graph representation of the protein-protein interfaces and a measure of evolutionary conservation. It achieves a scoring performance competitive with, or superior to that of the state-of-the-art scoring functions on independent data sets consisting docking software-specific data sets and the CAPRI score set built from a wide variety of docking approaches. iScore ranks among the top scoring approaches on the CAPRI score set (13 targets) when compared with the 37 scoring groups in CAPRI. The results demonstrate the utility of combining evolutionary and topological, and physicochemical information for scoring docked conformations. This work represents the first successful demonstration of graph kernel to protein interfaces for effective discrimination of near-native and non-native conformations of protein complexes. It paves the way for the further development of computational methods for predicting the structure of protein complexes.

Download Full-text

iScore: a novel graph kernel-based function for scoring protein–protein docking models

Bioinformatics ◽

10.1093/bioinformatics/btz496 ◽

2019 ◽

Vol 36 (1) ◽

pp. 112-121 ◽

Cited By ~ 9

Author(s):

Cunliang Geng ◽

Yong Jung ◽

Nicolas Renaud ◽

Vasant Honavar ◽

Alexandre M J J Bonvin ◽

...

Keyword(s):

Protein Complexes ◽

Three Dimensional ◽

Protein Docking ◽

Graph Representation ◽

Supplementary Information ◽

Scoring Functions ◽

Computational Docking ◽

3D Structures ◽

Novel Approach ◽

Protein Interfaces

Abstract Motivation Protein complexes play critical roles in many aspects of biological functions. Three-dimensional (3D) structures of protein complexes are critical for gaining insights into structural bases of interactions and their roles in the biomolecular pathways that orchestrate key cellular processes. Because of the expense and effort associated with experimental determinations of 3D protein complex structures, computational docking has evolved as a valuable tool to predict 3D structures of biomolecular complexes. Despite recent progress, reliably distinguishing near-native docking conformations from a large number of candidate conformations, the so-called scoring problem, remains a major challenge. Results Here we present iScore, a novel approach to scoring docked conformations that combines HADDOCK energy terms with a score obtained using a graph representation of the protein–protein interfaces and a measure of evolutionary conservation. It achieves a scoring performance competitive with, or superior to, that of state-of-the-art scoring functions on two independent datasets: (i) Docking software-specific models and (ii) the CAPRI score set generated by a wide variety of docking approaches (i.e. docking software-non-specific). iScore ranks among the top scoring approaches on the CAPRI score set (13 targets) when compared with the 37 scoring groups in CAPRI. The results demonstrate the utility of combining evolutionary, topological and energetic information for scoring docked conformations. This work represents the first successful demonstration of graph kernels to protein interfaces for effective discrimination of near-native and non-native conformations of protein complexes. Availability and implementation The iScore code is freely available from Github: https://github.com/DeepRank/iScore (DOI: 10.5281/zenodo.2630567). And the docking models used are available from SBGrid: https://data.sbgrid.org/dataset/684). Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Quality Assessment of Protein Docking Models Based on Graph Neural Network

Frontiers in Bioinformatics ◽

10.3389/fbinf.2021.693211 ◽

2021 ◽

Vol 1 ◽

Author(s):

Ye Han ◽

Fei He ◽

Yongbing Chen ◽

Wenyuan Qin ◽

Helong Yu ◽

...

Keyword(s):

Neural Network ◽

Quality Assessment ◽

Chemical Properties ◽

Protein Docking ◽

Structural Basis ◽

Docking Model ◽

Testing Dataset ◽

Independent Testing Dataset

Protein docking provides a structural basis for the design of drugs and vaccines. Among the processes of protein docking, quality assessment (QA) is utilized to pick near-native models from numerous protein docking candidate conformations, and it directly determines the final docking results. Although extensive efforts have been made to improve QA accuracy, it is still the bottleneck of current protein docking systems. In this paper, we presented a Deep Graph Attention Neural Network (DGANN) to evaluate and rank protein docking candidate models. DGANN learns inter-residue physio-chemical properties and structural fitness across the two protein monomers in a docking model and generates their probabilities of near-native models. On the ZDOCK decoy benchmark, our DGANN outperformed the ranking provided by ZDOCK in terms of ranking good models into the top selections. Furthermore, we conducted comparative experiments on an independent testing dataset, and the results also demonstrated the superiority and generalization of our proposed method.

Download Full-text

New machine learning and physics-based scoring functions for drug discovery

Scientific Reports ◽

10.1038/s41598-021-82410-1 ◽

2021 ◽

Vol 11 (1) ◽

Cited By ~ 1

Author(s):

Isabella A. Guedes ◽

André M. S. Barreto ◽

Diogo Marinho ◽

Eduardo Krempser ◽

Mélaine A. Kuenemann ◽

...

Keyword(s):

Machine Learning ◽

Drug Discovery ◽

Protein Interactions ◽

In Silico ◽

Drug Targets ◽

Support Vector ◽

Scoring Functions ◽

Protein Protein Interactions ◽

Energy Prediction ◽

Original Class

AbstractScoring functions are essential for modern in silico drug discovery. However, the accurate prediction of binding affinity by scoring functions remains a challenging task. The performance of scoring functions is very heterogeneous across different target classes. Scoring functions based on precise physics-based descriptors better representing protein–ligand recognition process are strongly needed. We developed a set of new empirical scoring functions, named DockTScore, by explicitly accounting for physics-based terms combined with machine learning. Target-specific scoring functions were developed for two important drug targets, proteases and protein–protein interactions, representing an original class of molecules for drug discovery. Multiple linear regression (MLR), support vector machine and random forest algorithms were employed to derive general and target-specific scoring functions involving optimized MMFF94S force-field terms, solvation and lipophilic interactions terms, and an improved term accounting for ligand torsional entropy contribution to ligand binding. DockTScore scoring functions demonstrated to be competitive with the current best-evaluated scoring functions in terms of binding energy prediction and ranking on four DUD-E datasets and will be useful for in silico drug design for diverse proteins as well as for specific targets such as proteases and protein–protein interactions. Currently, the MLR DockTScore is available at www.dockthor.lncc.br.

Download Full-text

A knowledge–based scoring function to assess the stability of quaternary protein assemblies

10.1101/562520 ◽

2019 ◽

Cited By ~ 3

Author(s):

Abhilesh S. Dhawanjewar ◽

Ankit Roy ◽

M.S. Madhusudhan

Keyword(s):

Protein Interactions ◽

Binary Classification ◽

Scoring Function ◽

Protein Docking ◽

Scoring Functions ◽

Protein Protein Interactions ◽

Residue Contact ◽

Knowledge Based ◽

Cellular Biochemistry ◽

The Stability

AbstractMotivationElucidation of protein-protein interactions is a necessary step towards understanding the complete repertoire of cellular biochemistry. Given the enormity of the problem, the expenses and limitations of experimental methods, it is imperative that this problem is tackled computationally. In silico predictions of protein interactions entail sampling different conformations of the purported complex and then scoring these to assess for interaction viability. In this study we have devised a new scheme for scoring protein-protein interactions.ResultsOur method, PIZSA (Protein Interaction Z Score Assessment) is a binary classification scheme for identification of stable protein quaternary assemblies (binders/non-binders) based on statistical potentials. The scoring scheme incorporates residue-residue contact preference on the interface with per residue-pair atomic contributions and accounts for clashes. PIZSA can accurately discriminate between native and non-native structural conformations from protein docking experiments and outperform other recently published scoring functions, demonstrated through testing on a benchmark set and the CAPRI Score_set. Though not explicitly trained for this purpose, PIZSA potentials can identify spurious interactions that are artefacts of the crystallization process.AvailabilityPIZSA is implemented as awebserverat http://cospi.iiserpune.ac.in/pizsa/[email protected]

Download Full-text

Atomic-level evolutionary information improves protein-protein interface scoring

Bioinformatics ◽

10.1093/bioinformatics/btab254 ◽

2021 ◽

Author(s):

Chloé Quignot ◽

Pierre Granger ◽

Pablo Chacón ◽

Raphael Guerois ◽

Jessica Andreani

Keyword(s):

Success Rate ◽

Protein Interactions ◽

Protein Docking ◽

Atomic Level ◽

Supplementary Information ◽

Evolutionary Information ◽

General Strategy ◽

Scoring Functions ◽

Success Rates ◽

Novel Strategy

Abstract Motivation The crucial role of protein interactions and the difficulty in characterising them experimentally strongly motivates the development of computational approaches for structural prediction. Even when protein-protein docking samples correct models, current scoring functions struggle to discriminate them from incorrect decoys. The previous incorporation of conservation and coevolution information has shown promise for improving protein-protein scoring. Here, we present a novel strategy to integrate atomic-level evolutionary information into different types of scoring functions to improve their docking discrimination. Results : We applied this general strategy to our residue-level statistical potential from InterEvScore and to two atomic-level scores, SOAP-PP and Rosetta interface score (ISC). Including evolutionary information from as few as ten homologous sequences improves the top 10 success rates of individual atomic-level scores SOAP-PP and Rosetta ISC by respectively 6 and 13.5 percentage points, on a large benchmark of 752 docking cases. The best individual homology-enriched score reaches a top 10 success rate of 34.4%. A consensus approach based on the complementarity between different homology-enriched scores further increases the top 10 success rate to 40%. Availability All data used for benchmarking and scoring results, as well as a Singularity container of the pipeline, are available at http://biodev.cea.fr/interevol/interevdata/ Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Text mining for modeling of protein complexes enhanced by machine learning

Bioinformatics ◽

10.1093/bioinformatics/btaa823 ◽

2020 ◽

Author(s):

Varsha D Badal ◽

Petras J Kundrotas ◽

Ilya A Vakser

Keyword(s):

Machine Learning ◽

Text Mining ◽

Protein Interactions ◽

Full Text ◽

Protein Complexes ◽

Protein Docking ◽

Supplementary Information ◽

Support Vector ◽

Learning Approaches ◽

Protein Protein Interactions

Abstract Motivation Procedures for structural modeling of protein-protein complexes (protein docking) produce a number of models which need to be further analyzed and scored. Scoring can be based on independently determined constraints on the structure of the complex, such as knowledge of amino acids essential for the protein interaction. Previously, we showed that text mining of residues in freely available PubMed abstracts of papers on studies of protein-protein interactions may generate such constraints. However, absence of post-processing of the spotted residues reduced usability of the constraints, as a significant number of the residues were not relevant for the binding of the specific proteins. Results We explored filtering of the irrelevant residues by two machine learning approaches, Deep Recursive Neural Network (DRNN) and Support Vector Machine (SVM) models with different training/testing schemes. The results showed that the DRNN model is superior to the SVM model when training is performed on the PMC-OA full-text articles and applied to classification (interface or non-interface) of the residues spotted in the PubMed abstracts. When both training and testing is performed on full-text articles or on abstracts, the performance of these models is similar. Thus, in such cases, there is no need to utilize computationally demanding DRNN approach, which is computationally expensive especially at the training stage. The reason is that SVM success is often determined by the similarity in data/text patterns in the training and the testing sets, whereas the sentence structures in the abstracts are, in general, different from those in the full text articles. Availability The code and the datasets generated in this study are available at https://gitlab.ku.edu/vakser-lab-public/text-mining/-/tree/2020-09-04. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

pyDockEneRes: per-residue decomposition of protein–protein docking energy

Bioinformatics ◽

10.1093/bioinformatics/btz884 ◽

2019 ◽

Vol 36 (7) ◽

pp. 2284-2285 ◽

Cited By ~ 1

Author(s):

Miguel Romero-Durana ◽

Brian Jiménez-García ◽

Juan Fernández-Recio

Keyword(s):

Binding Affinity ◽

Protein Interactions ◽

Structural Model ◽

Protein Complexes ◽

Complex Structure ◽

Protein Docking ◽

Supplementary Information ◽

Scoring Functions ◽

Residue Decomposition ◽

Docking Energy

Abstract Motivation Protein–protein interactions are key to understand biological processes at the molecular level. As a complement to experimental characterization of protein interactions, computational docking methods have become useful tools for the structural and energetics modeling of protein–protein complexes. A key aspect of such algorithms is the use of scoring functions to evaluate the generated docking poses and try to identify the best models. When the scoring functions are based on energetic considerations, they can help not only to provide a reliable structural model for the complex, but also to describe energetic aspects of the interaction. This is the case of the scoring function used in pyDock, a combination of electrostatics, desolvation and van der Waals energy terms. Its correlation with experimental binding affinity values of protein–protein complexes was explored in the past, but the per-residue decomposition of the docking energy was never systematically analyzed. Results Here, we present pyDockEneRes (pyDock Energy per-Residue), a web server that provides pyDock docking energy partitioned at the residue level, giving a much more detailed description of the docking energy landscape. Additionally, pyDockEneRes computes the contribution to the docking energy of the side-chain atoms. This fast approach can be applied to characterize a complex structure in order to identify energetically relevant residues (hot-spots) and estimate binding affinity changes upon mutation to alanine. Availability and implementation The server does not require registration by the user and is freely accessible for academics at https://life.bsc.es/pid/pydockeneres. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Atomic-level evolutionary information improves protein-protein interface scoring

10.1101/2020.10.26.355073 ◽

2020 ◽

Author(s):

Chloé Quignot ◽

Pierre Granger ◽

Pablo Chacón ◽

Raphael Guerois ◽

Jessica Andreani

Keyword(s):

Success Rate ◽

Protein Interactions ◽

Protein Docking ◽

Atomic Level ◽

Evolutionary Information ◽

General Strategy ◽

Scoring Functions ◽

Success Rates ◽

Novel Strategy ◽

Individual Scores

AbstractThe crucial role of protein interactions and the difficulty in characterising them experimentally strongly motivates the development of computational approaches for structural prediction. Even when protein-protein docking samples correct models, current scoring functions struggle to discriminate them from incorrect decoys. The previous incorporation of conservation and coevolution information has shown promise for improving protein-protein scoring. Here, we present a novel strategy to integrate atomic-level evolutionary information into different types of scoring functions to improve their docking discrimination.We applied this general strategy to our residue-level statistical potential from InterEvScore and to two atomic-level scores, SOAP-PP and Rosetta interface score (ISC). Including evolutionary information from as few as ten homologous sequences improves the top 10 success rates of these individual scores by respectively 6.5, 6 and 13.5 percentage points, on a large benchmark of 752 docking cases. The best individual homology-enriched score reaches a top 10 success rate of 34.4%. A consensus approach based on the complementarity between different homology-enriched scores further increases the top 10 success rate to 40%.All data used for benchmarking and scoring results, as well as pipelining scripts, are available at http://biodev.cea.fr/interevol/interevdata/

Download Full-text

A Random Forest Classifier for Protein-Protein Docking Models

10.1101/2021.06.23.449420 ◽

2021 ◽

Author(s):

Didier Barradas-Bautista ◽

Zhen Cao ◽

Anna Vangone ◽

Romina Oliva ◽

Luigi Cavallo

Keyword(s):

Machine Learning ◽

Random Forest ◽

Protein Complexes ◽

Protein Docking ◽

Random Forest Classifier ◽

Features Selection ◽

Learning Approaches ◽

Scoring Functions ◽

Comparative Performance ◽

Protein Protein Interaction

Herein, we present the results of a machine learning approach we developed to single out correct 3D docking models of protein-protein complexes obtained by popular docking software. To this aim, we generated a set of ~7xE06 docking models with three different docking programs (HADDOCK, FTDock and ZDOCK) for the 230 complexes in the protein-protein interaction benchmark, version 5 (BM5). Three different machine-learning approaches (Random Forest, Supported Vector Machine and Perceptron) were used to train classifiers with 158 different scoring functions (features). The Random Forest algorithm outperformed the other two algorithms and was selected for further optimization. Using a features selection algorithm, and optimizing the random forest hyperparameters, allowed us to train and validate a random forest classifier, named CoDES (COnservation Driven Expert System). Testing of CoDES on independent datasets, as well as results of its comparative performance with machine-learning methods recently developed in the field for the scoring of docking decoys, confirm its state-of-the-art ability to discriminate correct from incorrect decoys both in terms of global parameters and in terms of decoys ranked at the top positions.

Download Full-text