scholarly journals DeepGOZero: Improving protein function prediction from sequence and zero-shot learning based on ontology axioms

2022 ◽  
Author(s):  
Maxat Kulmanov ◽  
Robert Hoehndorf

Motivation: Protein functions are often described using the Gene Ontology (GO) which is an ontology consisting of over 50,000 classes and a large set of formal axioms. Predicting the functions of proteins is one of the key challenges in computational biology and a variety of machine learning methods have been developed for this purpose. However, these methods usually require significant amount of training data and cannot make predictions for GO classes which have only few or no experimental annotations. Results: We developed DeepGOZero, a machine learning model which improves predictions for functions with no or only a small number of annotations. To achieve this goal, we rely on a model-theoretic approach for learning ontology embeddings and combine it with neural networks for protein function prediction. DeepGOZero can exploit formal axioms in the GO to make zero-shot predictions, i.e., predict protein functions even if not a single protein in the training phase was associated with that function. Furthermore, the zero-shot prediction method employed by DeepGOZero is generic and can be applied whenever associations with ontology classes need to be predicted. Availability: http://github.com/bio-ontology-research-group/deepgozero

2019 ◽  
Author(s):  
Kai Hakala ◽  
Suwisa Kaewphan ◽  
Jari Björne ◽  
Farrokh Mehryary ◽  
Hans Moen ◽  
...  

AbstractOver the past decade, the demand for automated protein function prediction has increased due to the volume of newly sequenced proteins. In this paper, we address the function prediction task by developing an ensemble system automatically assigning Gene Ontology (GO) terms to the given input protein sequence.We develop an ensemble system which combines the GO predictions made by random forest (RF) and neural network (NN) classifiers. Both RF and NN models rely on features derived from BLAST sequence alignments, taxonomy and protein signature analysis tools. In addition, we report on experiments with a NN model that directly analyzes the amino acid sequence as its sole input, using a convolutional layer. The Swiss-Prot database is used as the training and evaluation data.In the CAFA3 evaluation, which relies on experimental verification of the functional predictions, our submitted ensemble model demonstrates competitive performance ranking among top-10 best-performing systems out of over 100 submitted systems. In this paper, we evaluate and further improve the CAFA3-submitted system. Our machine learning models together with the data pre-processing and feature generation tools are publicly available as an open source software athttps://github.com/TurkuNLP/CAFA3Author summaryUnderstanding the role and function of proteins in biological processes is fundamental for new biological discoveries. Whereas modern sequencing methods have led to a rapid growth of protein databases, the function of these sequences is often unknown and expensive to determine experimentally. This has spurred a lot of interest in predictive modelling of protein functions.We develop a machine learning system for annotating protein sequences with functional definitions selected from a vast set of predefined functions. The approach is based on a combination of neural network and random forest classifiers with features covering structural and taxonomic properties and sequence similarity. The system is thoroughly evaluated on a large set of manually curated functional annotations and shows competitive performance in comparison to other suggested approaches. We also analyze the predictions for different functional annotation and taxonomy categories and measure the importance of different features for the task. This analysis reveals that the system is particularly efficient for bacterial protein sequences.


Author(s):  
Amelia Villegas-Morcillo ◽  
Stavros Makrodimitris ◽  
Roeland C H J van Ham ◽  
Angel M Gomez ◽  
Victoria Sanchez ◽  
...  

Abstract Motivation Protein function prediction is a difficult bioinformatics problem. Many recent methods use deep neural networks to learn complex sequence representations and predict function from these. Deep supervised models require a lot of labeled training data which are not available for this task. However, a very large amount of protein sequences without functional labels is available. Results We applied an existing deep sequence model that had been pretrained in an unsupervised setting on the supervised task of protein molecular function prediction. We found that this complex feature representation is effective for this task, outperforming hand-crafted features such as one-hot encoding of amino acids, k-mer counts, secondary structure and backbone angles. Also, it partly negates the need for complex prediction models, as a two-layer perceptron was enough to achieve competitive performance in the third Critical Assessment of Functional Annotation benchmark. We also show that combining this sequence representation with protein 3D structure information does not lead to performance improvement, hinting that 3D structure is also potentially learned during the unsupervised pretraining. Availability and implementation Implementations of all used models can be found at https://github.com/stamakro/GCN-for-Structure-and-Function. Supplementary information Supplementary data are available at Bioinformatics online.


2018 ◽  
Vol 35 (5) ◽  
pp. 753-759 ◽  
Author(s):  
Aashish Jain ◽  
Daisuke Kihara

Abstract Motivation Function annotation of proteins is fundamental in contemporary biology across fields including genomics, molecular biology, biochemistry, systems biology and bioinformatics. Function prediction is indispensable in providing clues for interpreting omics-scale data as well as in assisting biologists to build hypotheses for designing experiments. As sequencing genomes is now routine due to the rapid advancement of sequencing technologies, computational protein function prediction methods have become increasingly important. A conventional method of annotating a protein sequence is to transfer functions from top hits of a homology search; however, this approach has substantial short comings including a low coverage in genome annotation. Results Here we have developed Phylo-PFP, a new sequence-based protein function prediction method, which mines functional information from a broad range of similar sequences, including those with a low sequence similarity identified by a PSI-BLAST search. To evaluate functional similarity between identified sequences and the query protein more accurately, Phylo-PFP reranks retrieved sequences by considering their phylogenetic distance. Compared to the Phylo-PFP’s predecessor, PFP, which was among the top ranked methods in the second round of the Critical Assessment of Functional Annotation (CAFA2), Phylo-PFP demonstrated substantial improvement in prediction accuracy. Phylo-PFP was further shown to outperform prediction programs to date that were ranked top in CAFA2. Availability and implementation Phylo-PFP web server is available for at http://kiharalab.org/phylo_pfp.php. Supplementary information Supplementary data are available at Bioinformatics online.


2005 ◽  
Vol 45 (supplement) ◽  
pp. S195
Author(s):  
A. Suzuki ◽  
T. Ando ◽  
A. Matsumura ◽  
H. Sakao ◽  
I. Yamato ◽  
...  

2015 ◽  
Vol 2015 ◽  
pp. 1-9
Author(s):  
Jian-Sheng Wu ◽  
Hai-Feng Hu ◽  
Shan-Cheng Yan ◽  
Li-Hua Tang

Nature often brings several domains together to form multidomain and multifunctional proteins with a vast number of possibilities. In our previous study, we disclosed that the protein function prediction problem is naturally and inherently Multi-Instance Multilabel (MIML) learning tasks. Automated protein function prediction is typically implemented under the assumption that the functions of labeled proteins are complete; that is, there are no missing labels. In contrast, in practice just a subset of the functions of a protein are known, and whether this protein has other functions is unknown. It is evident that protein function prediction tasks suffer fromweak-labelproblem; thus protein function prediction with incomplete annotation matches well with the MIML with weak-label learning framework. In this paper, we have applied the state-of-the-art MIML with weak-label learning algorithm MIMLwel for predicting protein functions in two typical real-world electricigens organisms which have been widely used in microbial fuel cells (MFCs) researches. Our experimental results validate the effectiveness of MIMLwel algorithm in predicting protein functions with incomplete annotation.


Sign in / Sign up

Export Citation Format

Share Document