scholarly journals Unsupervised protein embeddings outperform hand-crafted sequence and structure features at predicting molecular function

Author(s):  
Amelia Villegas-Morcillo ◽  
Stavros Makrodimitris ◽  
Roeland C H J van Ham ◽  
Angel M Gomez ◽  
Victoria Sanchez ◽  
...  

Abstract Motivation Protein function prediction is a difficult bioinformatics problem. Many recent methods use deep neural networks to learn complex sequence representations and predict function from these. Deep supervised models require a lot of labeled training data which are not available for this task. However, a very large amount of protein sequences without functional labels is available. Results We applied an existing deep sequence model that had been pretrained in an unsupervised setting on the supervised task of protein molecular function prediction. We found that this complex feature representation is effective for this task, outperforming hand-crafted features such as one-hot encoding of amino acids, k-mer counts, secondary structure and backbone angles. Also, it partly negates the need for complex prediction models, as a two-layer perceptron was enough to achieve competitive performance in the third Critical Assessment of Functional Annotation benchmark. We also show that combining this sequence representation with protein 3D structure information does not lead to performance improvement, hinting that 3D structure is also potentially learned during the unsupervised pretraining. Availability and implementation Implementations of all used models can be found at https://github.com/stamakro/GCN-for-Structure-and-Function. Supplementary information Supplementary data are available at Bioinformatics online.

2020 ◽  
Author(s):  
Amelia Villegas-Morcillo ◽  
Stavros Makrodimitris ◽  
Roeland C.H.J. van Ham ◽  
Angel M. Gomez ◽  
Victoria Sanchez ◽  
...  

AbstractMotivationProtein function prediction is a difficult bioinformatics problem. Many recent methods use deep neural networks to learn complex sequence representations and predict function from these. Deep supervised models require a lot of labeled training data which are not available for this task. However, a very large amount of protein sequences without functional labels is available.ResultsWe applied an existing deep sequence model that had been pre-trained in an unsupervised setting on the supervised task of protein function prediction. We found that this complex feature representation is effective for this task, outperforming hand-crafted features such as one-hot encoding of amino acids, k-mer counts, secondary structure and backbone angles. Also, it partly negates the need for deep prediction models, as a two-layer perceptron was enough to achieve state-of-the-art performance in the third Critical Assessment of Functional Annotation benchmark. We also show that combining this sequence representation with protein 3D structure information does not lead to performance improvement, hinting that three-dimensional structure is also potentially learned during the unsupervised pre-training.AvailabilityImplementations of all used models can be found at https://github.com/stamakro/GCN-for-Structure-and-Function.Contactameliavm@ugr.esSupplementary informationSupplementary data are available online.


2021 ◽  
Author(s):  
Irene van den Bent ◽  
Stavros Makrodimitris ◽  
Marcel Reinders

AbstractComputationally annotating proteins with a molecular function is a difficult problem that is made even harder due to the limited amount of available labelled protein training data. A recently published supervised molecular function predicting model partly circumvents this limitation by making its predictions based on the universal (i.e. task-agnostic) contextualised protein embeddings from the deep pre-trained unsupervised protein language model SeqVec. SeqVec embeddings incorporate contextual information of amino acids, thereby modelling the underlying principles of protein sequences insensitive to the context of species.We applied the existing SeqVec-based molecular function prediction model in a transfer learning task by training the model on annotated protein sequences of one training species and making predictions on the proteins of several test species with varying evolutionary distance. We show that this approach successfully generalises knowledge about protein function from one eukaryotic species to various other species, proving itself an effective method for molecular function prediction in inadequately annotated species from understudied taxonomic kingdoms. Furthermore, we submitted the performance of our SeqVec-based prediction models to detailed characterisation, first to advance the understanding of protein language models and second to determine areas of improvement.Author summaryProteins are diverse molecules that regulate all processes in biology. The field of synthetic biology aims to understand these protein functions to solve problems in medicine, manufacturing, and agriculture. Unfortunately, for many proteins only their amino acid sequence is known whereas their function remains unknown. Only a few species have been well-studied such as mouse, human and yeast. Hence, we need to increase knowledge on protein functions. Doing so is, however, complicated as determining protein functions experimentally is time-consuming, expensive, and technically limited. Computationally predicting protein functions offers a faster and more scalable approach but is hampered as it requires much data to design accurate function prediction algorithms. Here, we show that it is possible to computationally generalize knowledge on protein function from one well-studied training species to another test species. Additionally, we show that the quality of these protein function predictions depends on how structurally similar the proteins are between the species. Advantageously, the predictors require only the annotations of proteins from the training species and mere amino acid sequences of test species which may particularly benefit the function prediction of species from understudied taxonomic kingdoms such as the Plantae, Protozoa and Chromista.


2021 ◽  
Vol 17 ◽  
pp. 117693432110626
Author(s):  
Irene van den Bent ◽  
Stavros Makrodimitris ◽  
Marcel Reinders

Computationally annotating proteins with a molecular function is a difficult problem that is made even harder due to the limited amount of available labeled protein training data. Unsupervised protein embeddings partly circumvent this limitation by learning a universal protein representation from many unlabeled sequences. Such embeddings incorporate contextual information of amino acids, thereby modeling the underlying principles of protein sequences insensitive to the context of species. We used an existing pre-trained protein embedding method and subjected its molecular function prediction performance to detailed characterization, first to advance the understanding of protein language models, and second to determine areas of improvement. Then, we applied the model in a transfer learning task by training a function predictor based on the embeddings of annotated protein sequences of one training species and making predictions on the proteins of several test species with varying evolutionary distance. We show that this approach successfully generalizes knowledge about protein function from one eukaryotic species to various other species, outperforming both an alignment-based and a supervised-learning-based baseline. This implies that such a method could be effective for molecular function prediction in inadequately annotated species from understudied taxonomic kingdoms.


2022 ◽  
Author(s):  
Maxat Kulmanov ◽  
Robert Hoehndorf

Motivation: Protein functions are often described using the Gene Ontology (GO) which is an ontology consisting of over 50,000 classes and a large set of formal axioms. Predicting the functions of proteins is one of the key challenges in computational biology and a variety of machine learning methods have been developed for this purpose. However, these methods usually require significant amount of training data and cannot make predictions for GO classes which have only few or no experimental annotations. Results: We developed DeepGOZero, a machine learning model which improves predictions for functions with no or only a small number of annotations. To achieve this goal, we rely on a model-theoretic approach for learning ontology embeddings and combine it with neural networks for protein function prediction. DeepGOZero can exploit formal axioms in the GO to make zero-shot predictions, i.e., predict protein functions even if not a single protein in the training phase was associated with that function. Furthermore, the zero-shot prediction method employed by DeepGOZero is generic and can be applied whenever associations with ontology classes need to be predicted. Availability: http://github.com/bio-ontology-research-group/deepgozero


2018 ◽  
Vol 35 (5) ◽  
pp. 753-759 ◽  
Author(s):  
Aashish Jain ◽  
Daisuke Kihara

Abstract Motivation Function annotation of proteins is fundamental in contemporary biology across fields including genomics, molecular biology, biochemistry, systems biology and bioinformatics. Function prediction is indispensable in providing clues for interpreting omics-scale data as well as in assisting biologists to build hypotheses for designing experiments. As sequencing genomes is now routine due to the rapid advancement of sequencing technologies, computational protein function prediction methods have become increasingly important. A conventional method of annotating a protein sequence is to transfer functions from top hits of a homology search; however, this approach has substantial short comings including a low coverage in genome annotation. Results Here we have developed Phylo-PFP, a new sequence-based protein function prediction method, which mines functional information from a broad range of similar sequences, including those with a low sequence similarity identified by a PSI-BLAST search. To evaluate functional similarity between identified sequences and the query protein more accurately, Phylo-PFP reranks retrieved sequences by considering their phylogenetic distance. Compared to the Phylo-PFP’s predecessor, PFP, which was among the top ranked methods in the second round of the Critical Assessment of Functional Annotation (CAFA2), Phylo-PFP demonstrated substantial improvement in prediction accuracy. Phylo-PFP was further shown to outperform prediction programs to date that were ranked top in CAFA2. Availability and implementation Phylo-PFP web server is available for at http://kiharalab.org/phylo_pfp.php. Supplementary information Supplementary data are available at Bioinformatics online.


Author(s):  
Alper Küçükural ◽  
Andras Szilagyi ◽  
O. Ugur Sezerman ◽  
Yang Zhang

To annotate the biological function of a protein molecule, it is essential to have information on its 3D structure. Many successful methods for function prediction are based on determining structurally conserved regions because the functional residues are proved to be more conservative than others in protein evolution. Since the 3D conformation of a protein can be represented by a contact map graph, graph matching, algorithms are often employed to identify the conserved residues in weakly homologous protein pairs. However, the general graph matching algorithm is computationally expensive because graph similarity searching is essentially a NP-hard problem. Parallel implementations of the graph matching are often exploited to speed up the process. In this chapter,the authors review theoretical and computational approaches of graph theory and the recently developed graph matching algorithms for protein function prediction.


Author(s):  
Maxat Kulmanov ◽  
Robert Hoehndorf

Abstract Motivation Protein function prediction is one of the major tasks of bioinformatics that can help in wide range of biological problems such as understanding disease mechanisms or finding drug targets. Many methods are available for predicting protein functions from sequence based features, protein–protein interaction networks, protein structure or literature. However, other than sequence, most of the features are difficult to obtain or not available for many proteins thereby limiting their scope. Furthermore, the performance of sequence-based function prediction methods is often lower than methods that incorporate multiple features and predicting protein functions may require a lot of time. Results We developed a novel method for predicting protein functions from sequence alone which combines deep convolutional neural network (CNN) model with sequence similarity based predictions. Our CNN model scans the sequence for motifs which are predictive for protein functions and combines this with functions of similar proteins (if available). We evaluate the performance of DeepGOPlus using the CAFA3 evaluation measures and achieve an Fmax of 0.390, 0.557 and 0.614 for BPO, MFO and CCO evaluations, respectively. These results would have made DeepGOPlus one of the three best predictors in CCO and the second best performing method in the BPO and MFO evaluations. We also compare DeepGOPlus with state-of-the-art methods such as DeepText2GO and GOLabeler on another dataset. DeepGOPlus can annotate around 40 protein sequences per second on common hardware, thereby making fast and accurate function predictions available for a wide range of proteins. Availability and implementation http://deepgoplus.bio2vec.net/. Supplementary information Supplementary data are available at Bioinformatics online.


2013 ◽  
pp. 386-399 ◽  
Author(s):  
Alper Küçükural ◽  
Andras Szilagyi ◽  
O. Ugur Sezerman ◽  
Yang Zhang

To annotate the biological function of a protein molecule, it is essential to have information on its 3D structure. Many successful methods for function prediction are based on determining structurally conserved regions because the functional residues are proved to be more conservative than others in protein evolution. Since the 3D conformation of a protein can be represented by a contact map graph, graph matching, algorithms are often employed to identify the conserved residues in weakly homologous protein pairs. However, the general graph matching algorithm is computationally expensive because graph similarity searching is essentially a NP-hard problem. Parallel implementations of the graph matching are often exploited to speed up the process. In this chapter,the authors review theoretical and computational approaches of graph theory and the recently developed graph matching algorithms for protein function prediction.


2021 ◽  
Author(s):  
Boqiao Lai ◽  
Jinbo Xu

Experimental protein function annotation does not scale with the fast-growing sequence databases. Only a tiny fraction (<0.1%) of protein sequences in UniProtKB has experimentally determined functional annotations. Computational methods may predict protein function in a high-throughput way, but its accuracy is not very satisfactory. Based upon recent breakthroughs in protein structure prediction and protein language models, we develop GAT-GO, a graph attention network (GAT) method that may substantially improve protein function prediction by leveraging predicted inter-residue contact graphs and protein sequence embedding. Our experimental results show that GAT-GO greatly outperforms the latest sequence- and structure-based deep learning methods. On the PDB-mmseqs testset where the train and test proteins share <15% sequence identity, GAT-GO yields Fmax(maximum F-score) 0.508, 0.416, 0.501, and AUPRC(area under the precision-recall curve) 0.427, 0.253, 0.411 for the MFO, BPO, CCO ontology domains, respectively, much better than homology-based method BLAST (Fmax 0.117,0.121,0.207 and AUPRC 0.120, 0.120, 0.163). On the PDB-cdhit testset where the training and test proteins share higher sequence identity, GAT-GO obtains Fmax 0.637, 0.501, 0.542 for the MFO, BPO, CCO ontology domains, respectively, and AUPRC 0.662, 0.384, 0.481, significantly exceeding the just-published graph convolution method DeepFRI, which has Fmax 0.542, 0.425, 0.424 and AUPRC 0.313, 0.159, 0.193.


Sign in / Sign up

Export Citation Format

Share Document