PROBselect: accurate prediction of protein-binding residues from proteins sequences via dynamic predictor selection

Fuhao Zhang; Wenbo Shi; Jian Zhang; Min Zeng; Min Li; Lukasz Kurgan

doi:10.1093/bioinformatics/btaa806

PROBselect: accurate prediction of protein-binding residues from proteins sequences via dynamic predictor selection

Bioinformatics ◽

10.1093/bioinformatics/btaa806 ◽

2020 ◽

Vol 36 (Supplement_2) ◽

pp. i735-i744

Author(s):

Fuhao Zhang ◽

Wenbo Shi ◽

Jian Zhang ◽

Min Zeng ◽

Min Li ◽

...

Keyword(s):

Protein Binding ◽

Protein Interactions ◽

Predictive Performance ◽

Protein Docking ◽

Supplementary Information ◽

Protein Protein Interactions ◽

Cross Prediction ◽

Predictive Quality ◽

Protein Functions ◽

Binding Residues

Abstract Motivation Knowledge of protein-binding residues (PBRs) improves our understanding of protein−protein interactions, contributes to the prediction of protein functions and facilitates protein−protein docking calculations. While many sequence-based predictors of PBRs were published, they offer modest levels of predictive performance and most of them cross-predict residues that interact with other partners. One unexplored option to improve the predictive quality is to design consensus predictors that combine results produced by multiple methods. Results We empirically investigate predictive performance of a representative set of nine predictors of PBRs. We report substantial differences in predictive quality when these methods are used to predict individual proteins, which contrast with the dataset-level benchmarks that are currently used to assess and compare these methods. Our analysis provides new insights for the cross-prediction concern, dissects complementarity between predictors and demonstrates that predictive performance of the top methods depends on unique characteristics of the input protein sequence. Using these insights, we developed PROBselect, first-of-its-kind consensus predictor of PBRs. Our design is based on the dynamic predictor selection at the protein level, where the selection relies on regression-based models that accurately estimate predictive performance of selected predictors directly from the sequence. Empirical assessment using a low-similarity test dataset shows that PROBselect provides significantly improved predictive quality when compared with the current predictors and conventional consensuses that combine residue-level predictions. Moreover, PROBselect informs the users about the expected predictive quality for the prediction generated from a given input protein. Availability and implementation PROBselect is available at http://bioinformatics.csu.edu.cn/PROBselect/home/index. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

SCRIBER: accurate and partner type-specific prediction of protein-binding residues from proteins sequences

Bioinformatics ◽

10.1093/bioinformatics/btz324 ◽

2019 ◽

Vol 35 (14) ◽

pp. i343-i353 ◽

Cited By ~ 10

Author(s):

Jian Zhang ◽

Lukasz Kurgan

Keyword(s):

Protein Binding ◽

Protein Interactions ◽

Rna Binding ◽

Protein Complexes ◽

Predictive Performance ◽

Protein Docking ◽

Supplementary Information ◽

Binding Residue ◽

Binding Residues ◽

The Cross

AbstractMotivationAccurate predictions of protein-binding residues (PBRs) enhances understanding of molecular-level rules governing protein–protein interactions, helps protein–protein docking and facilitates annotation of protein functions. Recent studies show that current sequence-based predictors of PBRs severely cross-predict residues that interact with other types of protein partners (e.g. RNA and DNA) as PBRs. Moreover, these methods are relatively slow, prohibiting genome-scale use.ResultsWe propose a novel, accurate and fast sequence-based predictor of PBRs that minimizes the cross-predictions. Our SCRIBER (SeleCtive pRoteIn-Binding rEsidue pRedictor) method takes advantage of three innovations: comprehensive dataset that covers multiple types of binding residues, novel types of inputs that are relevant to the prediction of PBRs, and an architecture that is tailored to reduce the cross-predictions. The dataset includes complete protein chains and offers improved coverage of binding annotations that are transferred from multiple protein–protein complexes. We utilize innovative two-layer architecture where the first layer generates a prediction of protein-binding, RNA-binding, DNA-binding and small ligand-binding residues. The second layer re-predicts PBRs by reducing overlap between PBRs and the other types of binding residues produced in the first layer. Empirical tests on an independent test dataset reveal that SCRIBER significantly outperforms current predictors and that all three innovations contribute to its high predictive performance. SCRIBER reduces cross-predictions by between 41% and 69% and our conservative estimates show that it is at least 3 times faster. We provide putative PBRs produced by SCRIBER for the entire human proteome and use these results to hypothesize that about 14% of currently known human protein domains bind proteins.Availability and implementationSCRIBER webserver is available at http://biomine.cs.vcu.edu/servers/SCRIBER/.Supplementary informationSupplementary data are available at Bioinformatics online.

Download Full-text

InterPep2: global peptide–protein docking using interaction surface templates

Bioinformatics ◽

10.1093/bioinformatics/btaa005 ◽

2020 ◽

Vol 36 (8) ◽

pp. 2458-2465 ◽

Cited By ~ 2

Author(s):

Isak Johansson-Åkhe ◽

Claudio Mirabello ◽

Björn Wallner

Keyword(s):

Protein Interactions ◽

Protein Complexes ◽

Structural Features ◽

Protein Docking ◽

Supplementary Information ◽

Peptide Ligand ◽

Protein Protein Interactions ◽

Intrinsically Disordered ◽

Intrinsically Disordered Regions ◽

Improved Performance

Abstract Motivation Interactions between proteins and peptides or peptide-like intrinsically disordered regions are involved in many important biological processes, such as gene expression and cell life-cycle regulation. Experimentally determining the structure of such interactions is time-consuming and difficult because of the inherent flexibility of the peptide ligand. Although several prediction-methods exist, most are limited in performance or availability. Results InterPep2 is a freely available method for predicting the structure of peptide–protein interactions. Improved performance is obtained by using templates from both peptide–protein and regular protein–protein interactions, and by a random forest trained to predict the DockQ-score for a given template using sequence and structural features. When tested on 252 bound peptide–protein complexes from structures deposited after the complexes used in the construction of the training and templates sets of InterPep2, InterPep2-Refined correctly positioned 67 peptides within 4.0 Å LRMSD among top10, similar to another state-of-the-art template-based method which positioned 54 peptides correctly. However, InterPep2 displays a superior ability to evaluate the quality of its own predictions. On a previously established set of 27 non-redundant unbound-to-bound peptide–protein complexes, InterPep2 performs on-par with leading methods. The extended InterPep2-Refined protocol managed to correctly model 15 of these complexes within 4.0 Å LRMSD among top10, without using templates from homologs. In addition, combining the template-based predictions from InterPep2 with ab initio predictions from PIPER-FlexPepDock resulted in 22% more near-native predictions compared to the best single method (22 versus 18). Availability and implementation The program is available from: http://wallnerlab.org/InterPep2. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

SAAMBE-SEQ: a sequence-based method for predicting mutation effect on protein–protein binding affinity

Bioinformatics ◽

10.1093/bioinformatics/btaa761 ◽

2020 ◽

Author(s):

Gen Li ◽

Swagata Pahari ◽

Adithya Krishna Murthy ◽

Siqi Liang ◽

Robert Fragoza ◽

...

Keyword(s):

Free Energy ◽

Protein Binding ◽

Binding Affinity ◽

Protein Interactions ◽

Structural Information ◽

Binding Free Energy ◽

Supplementary Information ◽

Sequence Information ◽

Protein Protein Interactions ◽

Genome Scale

Abstract Motivation Vast majority of human genetic disorders are associated with mutations that affect protein–protein interactions by altering wild-type binding affinity. Therefore, it is extremely important to assess the effect of mutations on protein–protein binding free energy to assist the development of therapeutic solutions. Currently, the most popular approaches use structural information to deliver the predictions, which precludes them to be applicable on genome-scale investigations. Indeed, with the progress of genomic sequencing, researchers are frequently dealing with assessing effect of mutations for which there is no structure available. Results Here, we report a Gradient Boosting Decision Tree machine learning algorithm, the SAAMBE-SEQ, which is completely sequence-based and does not require structural information at all. SAAMBE-SEQ utilizes 80 features representing evolutionary information, sequence-based features and change of physical properties upon mutation at the mutation site. The approach is shown to achieve Pearson correlation coefficient (PCC) of 0.83 in 5-fold cross validation in a benchmarking test against experimentally determined binding free energy change (ΔΔG). Further, a blind test (no-STRUC) is compiled collecting experimental ΔΔG upon mutation for protein complexes for which structure is not available and used to benchmark SAAMBE-SEQ resulting in PCC in the range of 0.37–0.46. The accuracy of SAAMBE-SEQ method is found to be either better or comparable to most advanced structure-based methods. SAAMBE-SEQ is very fast, available as webserver and stand-alone code, and indeed utilizes only sequence information, and thus it is applicable for genome-scale investigations to study the effect of mutations on protein–protein interactions. Availability and implementation SAAMBE-SEQ is available at http://compbio.clemson.edu/saambe_webserver/indexSEQ.php#started. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Text mining for modeling of protein complexes enhanced by machine learning

Bioinformatics ◽

10.1093/bioinformatics/btaa823 ◽

2020 ◽

Author(s):

Varsha D Badal ◽

Petras J Kundrotas ◽

Ilya A Vakser

Keyword(s):

Machine Learning ◽

Text Mining ◽

Protein Interactions ◽

Full Text ◽

Protein Complexes ◽

Protein Docking ◽

Supplementary Information ◽

Support Vector ◽

Learning Approaches ◽

Protein Protein Interactions

Abstract Motivation Procedures for structural modeling of protein-protein complexes (protein docking) produce a number of models which need to be further analyzed and scored. Scoring can be based on independently determined constraints on the structure of the complex, such as knowledge of amino acids essential for the protein interaction. Previously, we showed that text mining of residues in freely available PubMed abstracts of papers on studies of protein-protein interactions may generate such constraints. However, absence of post-processing of the spotted residues reduced usability of the constraints, as a significant number of the residues were not relevant for the binding of the specific proteins. Results We explored filtering of the irrelevant residues by two machine learning approaches, Deep Recursive Neural Network (DRNN) and Support Vector Machine (SVM) models with different training/testing schemes. The results showed that the DRNN model is superior to the SVM model when training is performed on the PMC-OA full-text articles and applied to classification (interface or non-interface) of the residues spotted in the PubMed abstracts. When both training and testing is performed on full-text articles or on abstracts, the performance of these models is similar. Thus, in such cases, there is no need to utilize computationally demanding DRNN approach, which is computationally expensive especially at the training stage. The reason is that SVM success is often determined by the similarity in data/text patterns in the training and the testing sets, whereas the sentence structures in the abstracts are, in general, different from those in the full text articles. Availability The code and the datasets generated in this study are available at https://gitlab.ku.edu/vakser-lab-public/text-mining/-/tree/2020-09-04. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Assessing the performance of the MM/PBSA and MM/GBSA methods. 6. Capability to predict protein–protein binding free energies and re-rank binding poses generated by protein–protein docking

Physical Chemistry Chemical Physics ◽

10.1039/c6cp03670h ◽

2016 ◽

Vol 18 (32) ◽

pp. 22129-22139 ◽

Cited By ~ 160

Author(s):

Fu Chen ◽

Hui Liu ◽

Huiyong Sun ◽

Peichen Pan ◽

Youyong Li ◽

...

Keyword(s):

Protein Binding ◽

Protein Interactions ◽

Protein Docking ◽

Biological Processes ◽

Free Energies ◽

Protein Protein Interactions ◽

Binding Free Energies

Understanding protein–protein interactions (PPIs) is quite important to elucidate crucial biological processes and even design compounds that interfere with PPIs with pharmaceutical significance.

Download Full-text

Machine learning empowers phosphoproteome prediction in cancers

Bioinformatics ◽

10.1093/bioinformatics/btz639 ◽

2019 ◽

Vol 36 (3) ◽

pp. 859-864 ◽

Cited By ~ 2

Author(s):

Hongyang Li ◽

Yuanfang Guan

Keyword(s):

Signaling Pathways ◽

Cancer Patients ◽

Protein Interactions ◽

Supplementary Information ◽

Protein Protein Interactions ◽

Post Translational Modification ◽

Cellular Processes ◽

Testing Dataset ◽

Protein Functions ◽

Cancer Tissues

Abstract Motivation Reversible protein phosphorylation is an essential post-translational modification regulating protein functions and signaling pathways in many cellular processes. Aberrant activation of signaling pathways often contributes to cancer development and progression. The mass spectrometry-based phosphoproteomics technique is a powerful tool to investigate the site-level phosphorylation of the proteome in a global fashion, paving the way for understanding the regulatory mechanisms underlying cancers. However, this approach is time-consuming and requires expensive instruments, specialized expertise and a large amount of starting material. An alternative in silico approach is predicting the phosphoproteomic profiles of cancer patients from the available proteomic, transcriptomic and genomic data. Results Here, we present a winning algorithm in the 2017 NCI-CPTAC DREAM Proteogenomics Challenge for predicting phosphorylation levels of the proteome across cancer patients. We integrate four components into our algorithm, including (i) baseline correlations between protein and phosphoprotein abundances, (ii) universal protein–protein interactions, (iii) shareable regulatory information across cancer tissues and (iv) associations among multi-phosphorylation sites of the same protein. When tested on a large held-out testing dataset of 108 breast and 62 ovarian cancer samples, our method ranked first in both cancer tissues, demonstrating its robustness and generalization ability. Availability and implementation Our code and reproducible results are freely available on GitHub: https://github.com/GuanLab/phosphoproteome_prediction. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

A knowledge-based scoring function to assess quaternary associations of proteins

Bioinformatics ◽

10.1093/bioinformatics/btaa207 ◽

2020 ◽

Vol 36 (12) ◽

pp. 3739-3748

Author(s):

Abhilesh S Dhawanjewar ◽

Ankit A Roy ◽

Mallur S Madhusudhan

Keyword(s):

Protein Interactions ◽

Statistical Physics ◽

Binary Classification ◽

Scoring Function ◽

Protein Docking ◽

Supplementary Information ◽

Scoring Functions ◽

Biological Interactions ◽

Protein Protein Interactions ◽

Knowledge Based

Abstract Motivation The elucidation of all inter-protein interactions would significantly enhance our knowledge of cellular processes at a molecular level. Given the enormity of the problem, the expenses and limitations of experimental methods, it is imperative that this problem is tackled computationally. In silico predictions of protein interactions entail sampling different conformations of the purported complex and then scoring these to assess for interaction viability. In this study, we have devised a new scheme for scoring protein–protein interactions. Results Our method, PIZSA (Protein Interaction Z-Score Assessment), is a binary classification scheme for identification of native protein quaternary assemblies (binders/nonbinders) based on statistical potentials. The scoring scheme incorporates residue–residue contact preference on the interface with per residue-pair atomic contributions and accounts for clashes. PIZSA can accurately discriminate between native and non-native structural conformations from protein docking experiments and outperform other contact-based potential scoring functions. The method has been extensively benchmarked and is among the top 6 methods, outperforming 31 other statistical, physics based and machine learning scoring schemes. The PIZSA potentials can also distinguish crystallization artifacts from biological interactions. Availability and implementation PIZSA is implemented as a web server at http://cospi.iiserpune.ac.in/pizsa and can be downloaded as a standalone package from http://cospi.iiserpune.ac.in/pizsa/Download/Download.html. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Rtpca: an R package for differential thermal proximity coaggregation analysis

Bioinformatics ◽

10.1093/bioinformatics/btaa682 ◽

2020 ◽

Cited By ~ 1

Author(s):

Nils Kurzawa ◽

André Mateus ◽

Mikhail M Savitski

Keyword(s):

Protein Interactions ◽

Predictive Performance ◽

R Package ◽

Supplementary Information ◽

Supplementary Data ◽

Protein Protein Interactions ◽

Proteome Profiling ◽

R Packages ◽

User Friendly

Abstract Summary Rtpca is an R package implementing methods for inferring protein–protein interactions (PPIs) based on thermal proteome profiling experiments of a single condition or in a differential setting via an approach called thermal proximity coaggregation. It offers user-friendly tools to explore datasets for their PPI predictive performance and easily integrates with available R packages. Availability and implementation Rtpca is available from Bioconductor (https://bioconductor.org/packages/Rtpca). Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

An assessment of machine and statistical learning approaches to inferring networks of protein-protein interactions

Journal of Integrative Bioinformatics ◽

10.1515/jib-2006-41 ◽

2006 ◽

Vol 3 (2) ◽

pp. 230-246 ◽

Cited By ~ 5

Author(s):

Fiona Browne ◽

Haiying Wang ◽

Huiru Zheng ◽

Francisco Azuaje

Keyword(s):

Statistical Learning ◽

Protein Interactions ◽

Predictive Power ◽

Predictive Performance ◽

Biological Data ◽

Classification Methods ◽

Protein Protein Interactions ◽

Genome Database ◽

Predictive Quality ◽

Auc Value

Abstract Protein-protein interactions (PPI) play a key role in many biological systems. Over the past few years, an explosion in availability of functional biological data obtained from high-throughput technologies to infer PPI has been observed. However, results obtained from such experiments show high rates of false positives and false negatives predictions as well as systematic predictive bias. Recent research has revealed that several machine and statistical learning methods applied to integrate relatively weak, diverse sources of large-scale functional data may provide improved predictive accuracy and coverage of PPI. In this paper we describe the effects of applying different computational, integrative methods to predict PPI in Saccharomyces cerevisiae. We investigated the predictive ability of combining different sets of relatively strong and weak predictive datasets. We analysed several genomic datasets ranging from mRNA co-expression to marginal essentiality. Moreover, we expanded an existing multi-source dataset from S. cerevisiae by constructing a new set of putative interactions extracted from Gene Ontology (GO)- driven annotations in the Saccharomyces Genome Database. Different classification techniques: Simple Naive Bayesian (SNB), Multilayer Perceptron (MLP) and K-Nearest Neighbors (KNN) were evaluated. Relatively simple classification methods (i.e. less computing intensive and mathematically complex), such as SNB, have been proven to be proficient at predicting PPI. SNB produced the “highest” predictive quality obtaining an area under Receiver Operating Characteristic (ROC) curve (AUC) value of 0.99. The lowest AUC value of 0.90 was obtained by the KNN classifier. This assessment also demonstrates the strong predictive power of GO-driven models, which offered predictive performance above 0.90 using the different machine learning and statistical techniques. As the predictive power of single-source datasets became weaker MLP and SNB performed better than KNN. Moreover, predictive performance saturation may be reached independently of the classification models applied, which may be explained by predictive bias and incompleteness of existing “Gold Standards”. More comprehensive and accurate PPI maps will be produced for S. cerevisiae and beyond with the emergence of largescale datasets of better predictive quality and the integration of intelligent classification methods.

Download Full-text

Decoding Protein-protein Interactions: An Overview

Current Topics in Medicinal Chemistry ◽

10.2174/1568026620666200226105312 ◽

2020 ◽

Vol 20 (10) ◽

pp. 855-882

Author(s):

Olivia Slater ◽

Bethany Miller ◽

Maria Kontoyianni

Keyword(s):

Drug Discovery ◽

Protein Interactions ◽

Drug Repurposing ◽

Protein Docking ◽

Target Space ◽

Protein Protein Interactions ◽

X Ray Crystallography ◽

Protein Protein Interaction ◽

Interaction Sites ◽

Long Time

Drug discovery has focused on the paradigm “one drug, one target” for a long time. However, small molecules can act at multiple macromolecular targets, which serves as the basis for drug repurposing. In an effort to expand the target space, and given advances in X-ray crystallography, protein-protein interactions have become an emerging focus area of drug discovery enterprises. Proteins interact with other biomolecules and it is this intricate network of interactions that determines the behavior of the system and its biological processes. In this review, we briefly discuss networks in disease, followed by computational methods for protein-protein complex prediction. Computational methodologies and techniques employed towards objectives such as protein-protein docking, protein-protein interactions, and interface predictions are described extensively. Docking aims at producing a complex between proteins, while interface predictions identify a subset of residues on one protein that could interact with a partner, and protein-protein interaction sites address whether two proteins interact. In addition, approaches to predict hot spots and binding sites are presented along with a representative example of our internal project on the chemokine CXC receptor 3 B-isoform and predictive modeling with IP10 and PF4.

Download Full-text