Beyond Homology Transfer: Deep Learning for Automated Annotation of Proteins

ABSTRACTAccurate annotation of protein functions is important for a profound understanding of molecular biology. A large number of proteins remain uncharacterized because of the sparsity of available supporting information. For a large set of uncharacterized proteins, the only type of information available is their amino acid sequence. In this paper, we propose DeepSeq – a deep learning architecture – that utilizes only the protein sequence information to predict its associated functions. The prediction process does not require handcrafted features; rather, the architecture automatically extracts representations from the input sequence data. Results of our experiments with DeepSeq indicate significant improvements in terms of prediction accuracy when compared with other sequence-based methods. Our deep learning model achieves an overall validation accuracy of 86.72%, with an F1 score of 71.13%. Moreover, using the automatically learned features and without any changes to DeepSeq, we successfully solved a different problem i.e. protein function localization, with no human intervention. Finally, we discuss how this same architecture can be used to solve even more complicated problems such as prediction of 2D and 3D structure as well as protein-protein interactions.

Download Full-text

Amalgamation of 3D structure and sequence information for protein–protein interaction prediction

Scientific Reports ◽

10.1038/s41598-020-75467-x ◽

2020 ◽

Vol 10 (1) ◽

Author(s):

Kanchan Jha ◽

Sriparna Saha

Keyword(s):

Amino Acids ◽

Deep Learning ◽

Protein Interactions ◽

3D Structure ◽

Protein Sequences ◽

Building Blocks ◽

New Drugs ◽

Sequence Information ◽

Protein Protein Interactions ◽

Structure Information

Abstract Protein is the primary building block of living organisms. It interacts with other proteins and is then involved in various biological processes. Protein–protein interactions (PPIs) help in predicting and hence help in understanding the functionality of the proteins, causes and growth of diseases, and designing new drugs. However, there is a vast gap between the available protein sequences and the identification of protein–protein interactions. To bridge this gap, researchers proposed several computational methods to reveal the interactions between proteins. These methods merely depend on sequence-based information of proteins. With the advancement of technology, different types of information related to proteins are available such as 3D structure information. Nowadays, deep learning techniques are adopted successfully in various domains, including bioinformatics. So, current work focuses on the utilization of different modalities, such as 3D structures and sequence-based information of proteins, and deep learning algorithms to predict PPIs. The proposed approach is divided into several phases. We first get several illustrations of proteins using their 3D coordinates information, and three attributes, such as hydropathy index, isoelectric point, and charge of amino acids. Amino acids are the building blocks of proteins. A pre-trained ResNet50 model, a subclass of a convolutional neural network, is utilized to extract features from these representations of proteins. Autocovariance and conjoint triad are two widely used sequence-based methods to encode proteins, which are used here as another modality of protein sequences. A stacked autoencoder is utilized to get the compact form of sequence-based information. Finally, the features obtained from different modalities are concatenated in pairs and fed into the classifier to predict labels for protein pairs. We have experimented on the human PPIs dataset and Saccharomyces cerevisiae PPIs dataset and compared our results with the state-of-the-art deep-learning-based classifiers. The results achieved by the proposed method are superior to those obtained by the existing methods. Extensive experimentations on different datasets indicate that our approach to learning and combining features from two different modalities is useful in PPI prediction.

Download Full-text

DeepGOZero: Improving protein function prediction from sequence and zero-shot learning based on ontology axioms

10.1101/2022.01.14.476325 ◽

2022 ◽

Author(s):

Maxat Kulmanov ◽

Robert Hoehndorf

Keyword(s):

Machine Learning ◽

Protein Function ◽

Protein Function Prediction ◽

Prediction Method ◽

Function Prediction ◽

Training Data ◽

Large Set ◽

Theoretic Approach ◽

Machine Learning Model ◽

Protein Functions

Motivation: Protein functions are often described using the Gene Ontology (GO) which is an ontology consisting of over 50,000 classes and a large set of formal axioms. Predicting the functions of proteins is one of the key challenges in computational biology and a variety of machine learning methods have been developed for this purpose. However, these methods usually require significant amount of training data and cannot make predictions for GO classes which have only few or no experimental annotations. Results: We developed DeepGOZero, a machine learning model which improves predictions for functions with no or only a small number of annotations. To achieve this goal, we rely on a model-theoretic approach for learning ontology embeddings and combine it with neural networks for protein function prediction. DeepGOZero can exploit formal axioms in the GO to make zero-shot predictions, i.e., predict protein functions even if not a single protein in the training phase was associated with that function. Furthermore, the zero-shot prediction method employed by DeepGOZero is generic and can be applied whenever associations with ontology classes need to be predicted. Availability: http://github.com/bio-ontology-research-group/deepgozero

Download Full-text

BGFE: A Deep Learning Model for ncRNA-Protein Interaction Predictions Based on Improved Sequence Information

International Journal of Molecular Sciences ◽

10.3390/ijms20040978 ◽

2019 ◽

Vol 20 (4) ◽

pp. 978 ◽

Cited By ~ 5

Author(s):

Zhao-Hui Zhan ◽

Li-Na Jia ◽

Yong Zhou ◽

Li-Ping Li ◽

Hai-Cheng Yi

Keyword(s):

Deep Learning ◽

Protein Interactions ◽

Prediction Accuracy ◽

Sparse Matrices ◽

Protein Sequences ◽

Biological Research ◽

Sequence Information ◽

Feature Extraction Method ◽

Cellular Processes ◽

High Level

The interactions between ncRNAs and proteins are critical for regulating various cellular processes in organisms, such as gene expression regulations. However, due to limitations, including financial and material consumptions in recent experimental methods for predicting ncRNA and protein interactions, it is essential to propose an innovative and practical approach with convincing performance of prediction accuracy. In this study, based on the protein sequences from a biological perspective, we put forward an effective deep learning method, named BGFE, to predict ncRNA and protein interactions. Protein sequences are represented by bi-gram probability feature extraction method from Position Specific Scoring Matrix (PSSM), and for ncRNA sequences, k-mers sparse matrices are employed to represent them. Furthermore, to extract hidden high-level feature information, a stacked auto-encoder network is employed with the stacked ensemble integration strategy. We evaluate the performance of the proposed method by using three datasets and a five-fold cross-validation after classifying the features through the random forest classifier. The experimental results clearly demonstrate the effectiveness and the prediction accuracy of our approach. In general, the proposed method is helpful for ncRNA and protein interacting predictions and it provides some serviceable guidance in future biological research.

Download Full-text

New advances in extracting and learning from protein–protein interactions within unstructured biomedical text data

Emerging Topics in Life Sciences ◽

10.1042/etls20190003 ◽

2019 ◽

Vol 3 (4) ◽

pp. 357-369

Author(s):

J. Harry Caufield ◽

Peipei Ping

Keyword(s):

Protein Interactions ◽

Protein Function ◽

Historical Context ◽

Specific Protein ◽

Basic Unit ◽

Biomedical Text ◽

Protein Protein Interactions ◽

Text Data ◽

Ppi Networks ◽

Protein Functions

Abstract Protein–protein interactions, or PPIs, constitute a basic unit of our understanding of protein function. Though substantial effort has been made to organize PPI knowledge into structured databases, maintenance of these resources requires careful manual curation. Even then, many PPIs remain uncurated within unstructured text data. Extracting PPIs from experimental research supports assembly of PPI networks and highlights relationships crucial to elucidating protein functions. Isolating specific protein–protein relationships from numerous documents is technically demanding by both manual and automated means. Recent advances in the design of these methods have leveraged emerging computational developments and have demonstrated impressive results on test datasets. In this review, we discuss recent developments in PPI extraction from unstructured biomedical text. We explore the historical context of these developments, recent strategies for integrating and comparing PPI data, and their application to advancing the understanding of protein function. Finally, we describe the challenges facing the application of PPI mining to the text concerning protein families, using the multifunctional 14-3-3 protein family as an example.

Download Full-text

Struct2Graph: A graph attention network for structure based predictions of protein-protein interactions

10.1101/2020.09.17.301200 ◽

2020 ◽

Author(s):

Mayank Baranwal ◽

Abram Magner ◽

Jacob Saldinger ◽

Emine S. Turali-Emre ◽

Shivani Kozarekar ◽

...

Keyword(s):

Protein Interactions ◽

Intracellular Signaling ◽

Metabolic Regulation ◽

Selection Process ◽

3D Structure ◽

Structural Data ◽

Sequence Information ◽

Protein Protein Interactions ◽

Attention Network ◽

Interaction Sites

AbstractDevelopment of new methods for analysis of protein-protein interactions (PPIs) at molecular and nanometer scales gives insights into intracellular signaling pathways and will improve understanding of protein functions, as well as other nanoscale structures of biological and abiological origins. Recent advances in computational tools, particularly the ones involving modern deep learning algorithms, have been shown to complement experimental approaches for describing and rationalizing PPIs. However, most of the existing works on PPI predictions use protein-sequence information, and thus have difficulties in accounting for the three-dimensional organization of the protein chains. In this study, we address this problem and describe a PPI analysis method based on a graph attention network, named Struct2Graph, for identifying PPIs directly from the structural data of folded protein globules. Our method is capable of predicting the PPI with an accuracy of 98.89% on the balanced set consisting of an equal number of positive and negative pairs. On the unbalanced set with the ratio of 1:10 between positive and negative pairs, Struct2Graph achieves a five-fold cross validation average accuracy of 99.42%. Moreover, unsupervised prediction of the interaction sites by Struct2Graph for phenol-soluble modulins are found to be in concordance with the previously reported binding sites for this family.Author summaryPPIs are the central part of signal transduction, metabolic regulation, environmental sensing, and cellular organization. Despite their success, most strategies to decode PPIs use sequence based approaches do not generalize to broader classes of chemical compounds of similar scale as proteins that are equally capable of forming complexes with proteins that are not based on amino acids, and thus lack of an equivalent sequence-based representation. Here, we address the problem of prediction of PPIs using a first of its kind, 3D structure based graph attention network (available at https://github.com/baranwa2/Struct2Graph). Despite its excellent prediction performance, the novel mutual attention mechanism provides insights into likely interaction sites through its knowledge selection process in a completely unsupervised manner.

Download Full-text

Protein SUMOylation is crucial for phagocytosis in Entamoeba histolytica trophozoites

10.1101/2021.02.01.429131 ◽

2021 ◽

Author(s):

Mitzi Díaz-Hernández ◽

Rosario Javier Reyna ◽

Izaid Sotto-Ortega ◽

Guillermina García-Rivera ◽

Maricela Sarita Montaño ◽

...

Keyword(s):

Entamoeba Histolytica ◽

Protein Interactions ◽

Posttranslational Modifications ◽

Covalent Binding ◽

3D Structure ◽

Fine Tuning ◽

Binding Motif ◽

C Terminus ◽

Protein Functions ◽

Target Molecules

AbstractDuring phagocytosis, a key event in the virulence of the protozoan Entamoeba histolytica, several molecules in concert contact the target, generate pseudopodia, and internalize and digest the ingested prey. Posttranslational modifications provide proteins the timing and signaling to intervene in these processes. SUMOylation is a posttranslational modification that in several systems grants a fine tuning for protein functions, protein interactions and cellular location, but it has not been studied in E. histolytica. In this paper, we characterized the E. histolytica SUMO gene and its product (EhSUMO) and elucidated the EhSUMO 3D-structure. Furthermore, here we studied the relevance of SUMOylation in phagocytosis, particularly in its association with EhADH (an ALIX family protein) and EhVps32 (a protein of the ESCRT-III complex), both involved in phagocytosis. Our results indicated that EhSUMO has an extended N-terminus that differentiates other SUMO from ubiquitin. It also presents the GG residues at the C-terminus and the ΨKXE/D binding motif, both involved in target protein contact. Additionally, E. histolytica genome possesses the enzymes belonging to the SUMOylation-deSUMOylation machineries. Confocal microscopy assays, using α−EhSUMO antibodies disclosed a remarkable membrane activity with convoluted and changing structures in trophozoites during erythrophagocytosis. SUMOylated proteins appeared in pseudopodia, phagocytic channels, and around the adhered and ingested erythrocytes. Docking analysis predicted interaction of EhSUMO with EhADH, and immunoprecipitation and immunofluorescence assays revealed that the EhADH-EhSUMO association increased during phagocytosis, whereas the EhVps32-EhSUMO interaction appeared stronger since basal conditions. In EhSUMO knocked down trophozoites, the bizarre membranous structures disappeared, and EhSUMO interaction with EhADH and EhVps32 diminished. Our results evidenced the presence of a SUMO gene in E. histolytica and the SUMOylation relevance during phagocytosis.Author’s AbstractPhagocytosis is one of the main functions that Entamoeba histolyitica trophozoites carry out during the invasion to the host. Many proteins are involved in this fascinating event, in which the plasmatic membrane undergoes to multiple and speedy changes. Posttraductional modifications activate proteins in the precise time that they must get involved. SUMOylation, that consists in the non-covalent binding of SUMO protein with target molecules, is one of the main changes suffered by proteins in order to enable them to participate in cellular functions. SUMOylation had not been studied in E. histolytica nor in phagocytosis, and our working hypothesis is that this event is deeply engaged in the ingestion of target molecules and cells. The results of this paper prove the presence of an intronless bona fide EhSUMO gene encoding for a predicted 12.6 kDa protein that is actively involved in phagocytosis. Silencing of the EhSUMO gene affected the rate of phagocytosis and interfered with the EhADH and EhVps32 function, two proteins involved in phagocytosis, strongly supporting the importance of SUMOylation in this event.

Download Full-text

Signaling interaction link prediction using deep graph neural networks integrating protein-protein interactions and omics data

10.1101/2020.12.23.424230 ◽

2020 ◽

Author(s):

Jiarui Feng ◽

Amanda Zeng ◽

Yixin Chen ◽

Philip Payne ◽

Fuhai Li

Keyword(s):

Deep Learning ◽

Signaling Pathways ◽

Protein Interactions ◽

Computational Models ◽

Tumor Development ◽

Learning Model ◽

Signaling Cascades ◽

Large Set ◽

Protein Protein Interactions ◽

Deep Learning Model

AbstractUncovering signaling links or cascades among proteins that potentially regulate tumor development and drug response is one of the most critical and challenging tasks in cancer molecular biology. Inhibition of the targets on the core signaling cascades can be effective as novel cancer treatment regimens. However, signaling cascades inference remains an open problem, and there is a lack of effective computational models. The widely used gene co-expression network (no-direct signaling cascades) and shortest-path based protein-protein interaction (PPI) network analysis (with too many interactions, and did not consider the sparsity of signaling cascades) were not specifically designed to predict the direct and sparse signaling cascades. To resolve the challenges, we proposed a novel deep learning model, deepSignalingLinkNet, to predict signaling cascades by integrating transcriptomics data and copy number data of a large set of cancer samples with the protein-protein interactions (PPIs) via a novel deep graph neural network model. Different from the existing models, the proposed deep learning model was trained using the curated KEGG signaling pathways to identify the informative omics and PPI topology features in the data-driven manner to predict the potential signaling cascades. The validation results indicated the feasibility of signaling cascade prediction using the proposed deep learning models. Moreover, the trained model can potentially predict the signaling cascades among the new proteins by transferring the learned patterns on the curated signaling pathways. The code was available at: https://github.com/fuhaililab/deepSignalingPathwayPrediction.

Download Full-text

An integration of deep learning with feature embedding for protein–protein interaction prediction

PeerJ ◽

10.7717/peerj.7126 ◽

2019 ◽

Vol 7 ◽

pp. e7126 ◽

Cited By ~ 7

Author(s):

Yu Yao ◽

Xiuquan Du ◽

Yanyu Diao ◽

Huaixu Zhu

Keyword(s):

Deep Learning ◽

Drug Discovery ◽

Protein Interactions ◽

Protein Function ◽

Molecular Mechanisms ◽

Protein Protein Interactions ◽

Model Combining ◽

Representation Method ◽

Structure Knowledge ◽

Matthew’S Correlation Coefficient

Protein–protein interactions are closely relevant to protein function and drug discovery. Hence, accurately identifying protein–protein interactions will help us to understand the underlying molecular mechanisms and significantly facilitate the drug discovery. However, the majority of existing computational methods for protein–protein interactions prediction are focused on the feature extraction and combination of features and there have been limited gains from the state-of-the-art models. In this work, a new residue representation method named Res2vec is designed for protein sequence representation. Residue representations obtained by Res2vec describe more precisely residue-residue interactions from raw sequence and supply more effective inputs for the downstream deep learning model. Combining effective feature embedding with powerful deep learning techniques, our method provides a general computational pipeline to infer protein–protein interactions, even when protein structure knowledge is entirely unknown. The proposed method DeepFE-PPI is evaluated on the S. Cerevisiae and human datasets. The experimental results show that DeepFE-PPI achieves 94.78% (accuracy), 92.99% (recall), 96.45% (precision), 89.62% (Matthew’s correlation coefficient, MCC) and 98.71% (accuracy), 98.54% (recall), 98.77% (precision), 97.43% (MCC), respectively. In addition, we also evaluate the performance of DeepFE-PPI on five independent species datasets and all the results are superior to the existing methods. The comparisons show that DeepFE-PPI is capable of predicting protein–protein interactions by a novel residue representation method and a deep learning classification framework in an acceptable level of accuracy. The codes along with instructions to reproduce this work are available from https://github.com/xal2019/DeepFE-PPI.

Download Full-text

Exploiting protein structure data to explore the evolution of protein function and biological complexity

Philosophical Transactions of the Royal Society B Biological Sciences ◽

10.1098/rstb.2005.1801 ◽

2006 ◽

Vol 361 (1467) ◽

pp. 425-440 ◽

Cited By ~ 17

Author(s):

Russell L Marsden ◽

Juan A.G Ranea ◽

Antonio Sillero ◽

Oliver Redfern ◽

Corin Yeats ◽

...

Keyword(s):

Protein Interactions ◽

Protein Function ◽

Active Sites ◽

Sequence Data ◽

Protein Biosynthesis ◽

X Ray Crystallography ◽

Metabolism Regulation ◽

Domain Structures ◽

Universal Domain ◽

Complete Sequencing

New directions in biology are being driven by the complete sequencing of genomes, which has given us the protein repertoires of diverse organisms from all kingdoms of life. In tandem with this accumulation of sequence data, worldwide structural genomics initiatives, advanced by the development of improved technologies in X-ray crystallography and NMR, are expanding our knowledge of structural families and increasing our fold libraries. Methods for detecting remote sequence similarities have also been made more sensitive and this means that we can map domains from these structural families onto genome sequences to understand how these families are distributed throughout the genomes and reveal how they might influence the functional repertoires and biological complexities of the organisms. We have used robust protocols to assign sequences from completed genomes to domain structures in the CATH database, allowing up to 60% of domain sequences in these genomes, depending on the organism, to be assigned to a domain family of known structure. Analysis of the distribution of these families throughout bacterial genomes identified more than 300 universal families, some of which had expanded significantly in proportion to genome size. These highly expanded families are primarily involved in metabolism and regulation and appear to make major contributions to the functional repertoire and complexity of bacterial organisms. When comparisons are made across all kingdoms of life, we find a smaller set of universal domain families (approx. 140), of which families involved in protein biosynthesis are the largest conserved component. Analysis of the behaviour of other families reveals that some (e.g. those involved in metabolism, regulation) have remained highly innovative during evolution, making it harder to trace their evolutionary ancestry. Structural analyses of metabolic families provide some insights into the mechanisms of functional innovation, which include changes in domain partnerships and significant structural embellishments leading to modulation of active sites and protein interactions.

Download Full-text

DeepTrio: Variant Calling in Families Using Deep Learning

10.1101/2021.04.05.438434 ◽

2021 ◽

Author(s):

Alexey Kolesnikov ◽

Sidharth Goel ◽

Maria Nattestad ◽

Taedong Yun ◽

Gunjan Baid ◽

...

Keyword(s):

Deep Learning ◽

De Novo ◽

Sequence Data ◽

Genetic Diseases ◽

Variant Calling ◽

Training Data ◽

Sequencing Error ◽

Sequence Information ◽

Genome Context ◽

Parental Inheritance

Every human inherits one copy of the genome from their mother and another from their father. Parental inheritance helps us understand the transmission of traits and genetic diseases, which often involve de novo variants and rare recessive alleles. Here we present DeepTrio, which learns to analyze child-mother-father trios from the joint sequence information, without explicit encoding of inheritance priors. DeepTrio learns how to weigh sequencing error, mapping error, and de novo rates and genome context directly from the sequence data. DeepTrio has higher accuracy on both Illumina and PacBio HiFi data when compared to DeepVariant. Improvements are especially pronounced at lower coverages (with 20x DeepTrio roughly equivalent to 30x DeepVariant). As DeepTrio learns directly from data, we also demonstrate extensions to exome calling solely by changing the training data. DeepTrio includes pre-trained models for Illumina WGS, Illumina exome, and PacBio HiFi.

Download Full-text