Developing Computational Model to Predict Protein-Protein Interaction Sites Based on the XGBoost Algorithm

The study of protein-protein interaction is of great biological significance, and the prediction of protein-protein interaction sites can promote the understanding of cell biological activity and will be helpful for drug development. However, uneven distribution between interaction and non-interaction sites is common because only a small number of protein interactions have been confirmed by experimental techniques, which greatly affects the predictive capability of computational methods. In this work, two imbalanced data processing strategies based on XGBoost algorithm were proposed to re-balance the original dataset from inherent relationship between positive and negative samples for the prediction of protein-protein interaction sites. Herein, a feature extraction method was applied to represent the protein interaction sites based on evolutionary conservatism of proteins, and the influence of overlapping regions of positive and negative samples was considered in prediction performance. Our method showed good prediction performance, such as prediction accuracy of 0.807 and MCC of 0.614, on an original dataset with 10,455 surface residues but only 2297 interface residues. Experimental results demonstrated the effectiveness of our XGBoost-based method.

Download Full-text

Prediction of Protein–Protein Interaction Sites Based on Stratified Attentional Mechanisms

Frontiers in Genetics ◽

10.3389/fgene.2021.784863 ◽

2021 ◽

Vol 12 ◽

Author(s):

Minli Tang ◽

Longxin Wu ◽

Xinyu Yu ◽

Zhaoqi Chu ◽

Shuting Jin ◽

...

Keyword(s):

Protein Interaction ◽

Protein Interactions ◽

Human Life ◽

Amino Acid Position ◽

Biological Macromolecules ◽

Protein Protein Interaction ◽

Interaction Sites ◽

Scoring Matrix ◽

Protein Interaction Sites ◽

Better Than

Proteins are the basic substances that undertake human life activities, and they often perform their biological functions through interactions with other biological macromolecules, such as cell transmission and signal transduction. Predicting the interaction sites between proteins can deepen the understanding of the principle of protein interactions, but traditional experimental methods are time-consuming and labor-intensive. In this study, a new hierarchical attention network structure, named HANPPIS, by adding six effective features of protein sequence, position-specific scoring matrix (PSSM), secondary structure, pre-training vector, hydrophilic, and amino acid position, is proposed to predict protein–protein interaction (PPI) sites. The experiment proved that our model has obtained very effective results, which was better than the existing advanced calculation methods. More importantly, we used the double-layer attention mechanism to improve the interpretability of the model and to a certain extent solved the problem of the “black box” of deep neural networks, which can be used as a reference for location positioning on the biological level.

Download Full-text

Semi-supervised prediction of protein interaction sites from unlabeled sample information

BMC Bioinformatics ◽

10.1186/s12859-019-3274-7 ◽

2019 ◽

Vol 20 (S25) ◽

Cited By ~ 1

Author(s):

Ye Wang ◽

Changqing Mei ◽

Yuming Zhou ◽

Yan Wang ◽

Chunhou Zheng ◽

...

Keyword(s):

Support Vector Machine ◽

Protein Interaction ◽

Protein Interactions ◽

Prediction Performance ◽

Support Vector ◽

Sequence Information ◽

Biological Processes ◽

Interaction Sites ◽

Unlabeled Sample ◽

Protein Interaction Sites

Abstract Background The recognition of protein interaction sites is of great significance in many biological processes, signaling pathways and drug designs. However, most sites on protein sequences cannot be defined as interface or non-interface sites because only a small part of protein interactions had been identified, which will cause the lack of prediction accuracy and generalization ability of predictors in protein interaction sites prediction. Therefore, it is necessary to effectively improve prediction performance of protein interaction sites using large amounts of unlabeled data together with small amounts of labeled data and background knowledge today. Results In this work, three semi-supervised support vector machine–based methods are proposed to improve the performance in the protein interaction sites prediction, in which the information of unlabeled protein sites can be involved. Herein, five features related with the evolutionary conservation of amino acids are extracted from HSSP database and Consurf Sever, i.e., residue spatial sequence spectrum, residue sequence information entropy and relative entropy, residue sequence conserved weight and residual Base evolution rate, to represent the residues within the protein sequence. Then three predictors are built for identifying the interface residues from protein surface using three types of semi-supervised support vector machine algorithms. Conclusion The experimental results demonstrated that the semi-supervised approaches can effectively improve prediction performance of protein interaction sites when unlabeled information is involved into the predictors and one of them can achieve the best prediction performance, i.e., the accuracy of 70.7%, the sensitivity of 62.67% and the specificity of 78.72%, respectively. With comparison to the existing studies, the semi-supervised models show the improvement of the predication performance.

Download Full-text

A Deep Learning and XGBoost-Based Method for Predicting Protein-Protein Interaction Sites

Frontiers in Genetics ◽

10.3389/fgene.2021.752732 ◽

2021 ◽

Vol 12 ◽

Author(s):

Pan Wang ◽

Guiyang Zhang ◽

Zu-Guo Yu ◽

Guohua Huang

Keyword(s):

Deep Learning ◽

Protein Interaction ◽

Protein Interactions ◽

Gradient Boosting ◽

Protein Protein Interactions ◽

Global Features ◽

Protein Protein Interaction ◽

Interaction Sites ◽

Extreme Gradient Boosting ◽

Protein Interaction Sites

Knowledge about protein-protein interactions is beneficial in understanding cellular mechanisms. Protein-protein interactions are usually determined according to their protein-protein interaction sites. Due to the limitations of current techniques, it is still a challenging task to detect protein-protein interaction sites. In this article, we presented a method based on deep learning and XGBoost (called DeepPPISP-XGB) for predicting protein-protein interaction sites. The deep learning model served as a feature extractor to remove redundant information from protein sequences. The Extreme Gradient Boosting algorithm was used to construct a classifier for predicting protein-protein interaction sites. The DeepPPISP-XGB achieved the following results: area under the receiver operating characteristic curve of 0.681, a recall of 0.624, and area under the precision-recall curve of 0.339, being competitive with the state-of-the-art methods. We also validated the positive role of global features in predicting protein-protein interaction sites.

Download Full-text