scholarly journals Multimodal deep representation learning for protein interaction identification and protein family classification

2019 ◽  
Vol 20 (S16) ◽  
Author(s):  
Da Zhang ◽  
Mansur Kabuka

Abstract Background Protein-protein interactions(PPIs) engage in dynamic pathological and biological procedures constantly in our life. Thus, it is crucial to comprehend the PPIs thoroughly such that we are able to illuminate the disease occurrence, achieve the optimal drug-target therapeutic effect and describe the protein complex structures. However, compared to the protein sequences obtainable from various species and organisms, the number of revealed protein-protein interactions is relatively limited. To address this dilemma, lots of research endeavor have investigated in it to facilitate the discovery of novel PPIs. Among these methods, PPI prediction techniques that merely rely on protein sequence data are more widespread than other methods which require extensive biological domain knowledge. Results In this paper, we propose a multi-modal deep representation learning structure by incorporating protein physicochemical features with the graph topological features from the PPI networks. Specifically, our method not only bears in mind the protein sequence information but also discerns the topological representations for each protein node in the PPI networks. In our paper, we construct a stacked auto-encoder architecture together with a continuous bag-of-words (CBOW) model based on generated metapaths to study the PPI predictions. Following by that, we utilize the supervised deep neural networks to identify the PPIs and classify the protein families. The PPI prediction accuracy for eight species ranged from 96.76% to 99.77%, which signifies that our multi-modal deep representation learning framework achieves superior performance compared to other computational methods. Conclusion To the best of our knowledge, this is the first multi-modal deep representation learning framework for examining the PPI networks.

2021 ◽  
Vol 11 (1) ◽  
Author(s):  
Yang Li ◽  
Zheng Wang ◽  
Li-Ping Li ◽  
Zhu-Hong You ◽  
Wen-Zhun Huang ◽  
...  

AbstractVarious biochemical functions of organisms are performed by protein–protein interactions (PPIs). Therefore, recognition of protein–protein interactions is very important for understanding most life activities, such as DNA replication and transcription, protein synthesis and secretion, signal transduction and metabolism. Although high-throughput technology makes it possible to generate large-scale PPIs data, it requires expensive cost of both time and labor, and leave a risk of high false positive rate. In order to formulate a more ingenious solution, biology community is looking for computational methods to quickly and efficiently discover massive protein interaction data. In this paper, we propose a computational method for predicting PPIs based on a fresh idea of combining orthogonal locality preserving projections (OLPP) and rotation forest (RoF) models, using protein sequence information. Specifically, the protein sequence is first converted into position-specific scoring matrices (PSSMs) containing protein evolutionary information by using the Position-Specific Iterated Basic Local Alignment Search Tool (PSI-BLAST). Then we characterize a protein as a fixed length feature vector by applying OLPP to PSSMs. Finally, we train an RoF classifier for the purpose of identifying non-interacting and interacting protein pairs. The proposed method yielded a significantly better results than existing methods, with 90.07% and 96.09% prediction accuracy on Yeast and Human datasets. Our experiment show the proposed method can serve as a useful tool to accelerate the process of solving key problems in proteomics.


2020 ◽  
Vol 2 (2) ◽  
Author(s):  
Guangyu Zhou ◽  
Muhao Chen ◽  
Chelsea J T Ju ◽  
Zheng Wang ◽  
Jyun-Yu Jiang ◽  
...  

Abstract The functional impact of protein mutations is reflected on the alteration of conformation and thermodynamics of protein–protein interactions (PPIs). Quantifying the changes of two interacting proteins upon mutations is commonly carried out by computational approaches. Hence, extensive research efforts have been put to the extraction of energetic or structural features on proteins, followed by statistical learning methods to estimate the effects of mutations on PPI properties. Nonetheless, such features require extensive human labors and expert knowledge to obtain, and have limited abilities to reflect point mutations. We present an end-to-end deep learning framework, MuPIPR (Mutation Effects in Protein–protein Interaction PRediction Using Contextualized Representations), to estimate the effects of mutations on PPIs. MuPIPR incorporates a contextualized representation mechanism of amino acids to propagate the effects of a point mutation to surrounding amino acid representations, therefore amplifying the subtle change in a long protein sequence. On top of that, MuPIPR leverages a Siamese residual recurrent convolutional neural encoder to encode a wild-type protein pair and its mutation pair. Multi-layer perceptron regressors are applied to the protein pair representations to predict the quantifiable changes of PPI properties upon mutations. Experimental evaluations show that, with only sequence information, MuPIPR outperforms various state-of-the-art systems on estimating the changes of binding affinity for SKEMPI v1, and offers comparable performance on SKEMPI v2. Meanwhile, MuPIPR also demonstrates state-of-the-art performance on estimating the changes of buried surface areas. The software implementation is available at https://github.com/guangyu-zhou/MuPIPR.


Molecules ◽  
2020 ◽  
Vol 25 (8) ◽  
pp. 1841 ◽  
Author(s):  
Da Xu ◽  
Hanxiao Xu ◽  
Yusen Zhang ◽  
Wei Chen ◽  
Rui Gao

Identification of protein-protein interactions (PPIs) plays an essential role in the understanding of protein functions and cellular biological activities. However, the traditional experiment-based methods are time-consuming and laborious. Therefore, developing new reliable computational approaches has great practical significance for the identification of PPIs. In this paper, a novel prediction method is proposed for predicting PPIs using graph energy, named PPI-GE. Particularly, in the process of feature extraction, we designed two new feature extraction methods, the physicochemical graph energy based on the ionization equilibrium constant and isoelectric point and the contact graph energy based on the contact information of amino acids. The dipeptide composition method was used for order information of amino acids. After multi-information fusion, principal component analysis (PCA) was implemented for eliminating noise and a robust weighted sparse representation-based classification (WSRC) classifier was applied for sample classification. The prediction accuracies based on the five-fold cross-validation of the human, Helicobacter pylori (H. pylori), and yeast data sets were 99.49%, 97.15%, and 99.56%, respectively. In addition, in five independent data sets and two significant PPI networks, the comparative experimental results also demonstrate that PPI-GE obtained better performance than the compared methods.


Cells ◽  
2019 ◽  
Vol 8 (2) ◽  
pp. 122 ◽  
Author(s):  
Yanbin Wang ◽  
Zhu-Hong You ◽  
Shan Yang ◽  
Xiao Li ◽  
Tong-Hai Jiang ◽  
...  

Many life activities and key functions in organisms are maintained by different types of protein–protein interactions (PPIs). In order to accelerate the discovery of PPIs for different species, many computational methods have been developed. Unfortunately, even though computational methods are constantly evolving, efficient methods for predicting PPIs from protein sequence information have not been found for many years due to limiting factors including both methodology and technology. Inspired by the similarity of biological sequences and languages, developing a biological language processing technology may provide a brand new theoretical perspective and feasible method for the study of biological sequences. In this paper, a pure biological language processing model is proposed for predicting protein–protein interactions only using a protein sequence. The model was constructed based on a feature representation method for biological sequences called bio-to-vector (Bio2Vec) and a convolution neural network (CNN). The Bio2Vec obtains protein sequence features by using a “bio-word” segmentation system and a word representation model used for learning the distributed representation for each “bio-word”. The Bio2Vec supplies a frame that allows researchers to consider the context information and implicit semantic information of a bio sequence. A remarkable improvement in PPIs prediction performance has been observed by using the proposed model compared with state-of-the-art methods. The presentation of this approach marks the start of “bio language processing technology,” which could cause a technological revolution and could be applied to improve the quality of predictions in other problems.


2019 ◽  
Vol 27 (01) ◽  
pp. 1-18
Author(s):  
YUANMIAO GUI ◽  
RUJING WANG ◽  
YUANYUAN WEI ◽  
XUE WANG

Protein–protein interaction (PPI) is very important for various biological processes and has given rise to a series of prediction-computing methods. In spite of different computing methods in relation to PPI prediction, PPI network projects fail to perform on a large scale. Aiming at ensuring that PPI can be predicted effectively, we used a deep neural network (DNN) for the study of PPI prediction that is based on an amino acid sequence. We present a novel DNN-PPI model with an auto covariance (AC) descriptor and a conjoint triad (CT) descriptor for the prediction of PPI that is based only on the protein sequence information. The 10-fold cross-validation indicated that the best DNN-PPI model with CT achieved 97.65% accuracy, 98.96% recall and a 98.51% area under the curve (AUC). The model exhibits a prediction accuracy of 94.20–97.10% for other external datasets. All of these suggest the high validity of the proposed algorithm in relation to various species.


2020 ◽  
Author(s):  
Dhananjay Kimothi ◽  
Pravesh Biyani ◽  
James M. Hogan ◽  
Melissa J. Davis

Abstract Background: Protein-Protein Interactions (PPIs) are a crucial mechanism underpinning the function of the cell. Predicting the likely relationship between a pair of proteins is thus an important problem in bioinformatics, and a wide range of machine-learning based methods have been proposed for this task. Their success is heavily dependent on the construction of the feature vectors, with most using a set of physicochemical properties derived from the sequence. Few work directly with the sequence itself. Recent works on embedding sequences in a low dimensional vector space has shown the utility of this approach for tasks such as protein classification and sequence search. In this paper, we extend these ideas to the PPI prediction task, making inferences from the pair instead of the individual sequences.Methods: We propose a generic PPI prediction framework that constitutes a representation learning module for feature construction and a binary classifier. To construct the feature vector for a protein pair, we concatenate the distributed representations (embeddings) learned for the sequences of the constituent proteins. Each protein pair is represented as a 200-dimensional feature vector. To learn the embedding of a sequence, we use two established methods - Seq2Vec and BioVec, and we also introduce a novel feature construction method and call it SuperVecNW. The embeddings generated through SuperVecNW captures network information to some extent, along with the contextual information present in the sequences. Finally, we feed these feature vectors into a Random forest classifier to predict protein pair interactions.Results: To show the efficacy of our proposed approach, we evaluate its performance on human and yeast PPI datasets, benchmarking against the established methods. Furthermore, we test our approach on three well known networks: the one-core network (CD9), the multiple-core network (Ras-Raf-Mek-Erk-Elk-Srf pathway), and the cross-connection network (Wnt-related network) and demonstrate the improvement in predicting PPIs compared to the other methods.Conclusions: Naive low dimensional sequence embeddings provide better results on protein-protein interaction prediction task than most of the alternative representations based on other physiochemical properties. These methods require computationally modest effort due to their lower dimensionality. Advanced representation learning methods that enrich the sequence embeddings with meta information are expected to improve the results further.


2019 ◽  
Author(s):  
Hassan Kané ◽  
Mohamed Coulibali ◽  
Ali Abdalla ◽  
Pelkins Ajanoh

ABSTRACTComputational methods that infer the function of proteins are key to understanding life at the molecular level. In recent years, representation learning has emerged as a powerful paradigm to discover new patterns among entities as varied as images, words, speech, molecules. In typical representation learning, there is only one source of data or one level of abstraction at which the learned representation occurs. However, proteins can be described by their primary, secondary, tertiary, and quaternary structure or even as nodes in protein-protein interaction networks. Given that protein function is an emergent property of all these levels of interactions in this work, we learn joint representations from both amino acid sequence and multilayer networks representing tissue-specific protein-protein interactions. Using these hybrid representations, we show that simple machine learning models trained using these hybrid representations outperform existing network-based methods on the task of tissue-specific protein function prediction on 13 out of 13 tissues. Furthermore, these representations outperform existing ones by 14% on average.


2019 ◽  
Author(s):  
Yi Guo ◽  
Xiang Chen

AbstractMotivationAlmost all critical functions and processes in cells are sustained by the cellular networks of protein-protein interactions (PPIs), understanding these is therefore crucial in the investigation of biological systems. Despite all past efforts, we still lack high-quality PPI data for constructing the networks, which makes it challenging to study the functions of association of proteins. High-throughput experimental techniques have produced abundant data for systematically studying the cellular networks of a biological system and the development of computational method for PPI identification.ResultsWe have developed a deep learning-based framework, named iPPI, for accurately predicting PPI on a proteome-wide scale depended only on sequence information. iPPI integrates the amino acid properties and compositions of protein sequence into a unified prediction framework using a hybrid deep neural network. Extensive tests demonstrated that iPPI can greatly outperform the state-of-the-art prediction methods in identifying PPIs. In addition, the iPPI prediction score can be related to the strength of protein-protein binding affinity and further showed the biological relevance of our deep learning framework to identify PPIs.Availability and ImplementationiPPI is available as an open-source software and can be downloaded from https://github.com/model-lab/[email protected]


Sign in / Sign up

Export Citation Format

Share Document