DeepGOZero: Improving protein function prediction from sequence and zero-shot learning based on ontology axioms

Mapping Intimacies ◽

10.1101/2022.01.14.476325 ◽

2022 ◽

Author(s):

Maxat Kulmanov ◽

Robert Hoehndorf

Keyword(s):

Machine Learning ◽

Protein Function ◽

Protein Function Prediction ◽

Prediction Method ◽

Function Prediction ◽

Training Data ◽

Large Set ◽

Theoretic Approach ◽

Machine Learning Model ◽

Protein Functions

Motivation: Protein functions are often described using the Gene Ontology (GO) which is an ontology consisting of over 50,000 classes and a large set of formal axioms. Predicting the functions of proteins is one of the key challenges in computational biology and a variety of machine learning methods have been developed for this purpose. However, these methods usually require significant amount of training data and cannot make predictions for GO classes which have only few or no experimental annotations. Results: We developed DeepGOZero, a machine learning model which improves predictions for functions with no or only a small number of annotations. To achieve this goal, we rely on a model-theoretic approach for learning ontology embeddings and combine it with neural networks for protein function prediction. DeepGOZero can exploit formal axioms in the GO to make zero-shot predictions, i.e., predict protein functions even if not a single protein in the training phase was associated with that function. Furthermore, the zero-shot prediction method employed by DeepGOZero is generic and can be applied whenever associations with ontology classes need to be predicted. Availability: http://github.com/bio-ontology-research-group/deepgozero

Download Full-text

Neural network and random forest models in protein function prediction

10.1101/690271 ◽

2019 ◽

Cited By ~ 1

Author(s):

Kai Hakala ◽

Suwisa Kaewphan ◽

Jari Björne ◽

Farrokh Mehryary ◽

Hans Moen ◽

...

Keyword(s):

Neural Network ◽

Machine Learning ◽

Random Forest ◽

Protein Function ◽

Protein Function Prediction ◽

Protein Sequences ◽

Function Prediction ◽

Learning System ◽

Large Set ◽

Competitive Performance

AbstractOver the past decade, the demand for automated protein function prediction has increased due to the volume of newly sequenced proteins. In this paper, we address the function prediction task by developing an ensemble system automatically assigning Gene Ontology (GO) terms to the given input protein sequence.We develop an ensemble system which combines the GO predictions made by random forest (RF) and neural network (NN) classifiers. Both RF and NN models rely on features derived from BLAST sequence alignments, taxonomy and protein signature analysis tools. In addition, we report on experiments with a NN model that directly analyzes the amino acid sequence as its sole input, using a convolutional layer. The Swiss-Prot database is used as the training and evaluation data.In the CAFA3 evaluation, which relies on experimental verification of the functional predictions, our submitted ensemble model demonstrates competitive performance ranking among top-10 best-performing systems out of over 100 submitted systems. In this paper, we evaluate and further improve the CAFA3-submitted system. Our machine learning models together with the data pre-processing and feature generation tools are publicly available as an open source software athttps://github.com/TurkuNLP/CAFA3Author summaryUnderstanding the role and function of proteins in biological processes is fundamental for new biological discoveries. Whereas modern sequencing methods have led to a rapid growth of protein databases, the function of these sequences is often unknown and expensive to determine experimentally. This has spurred a lot of interest in predictive modelling of protein functions.We develop a machine learning system for annotating protein sequences with functional definitions selected from a vast set of predefined functions. The approach is based on a combination of neural network and random forest classifiers with features covering structural and taxonomic properties and sequence similarity. The system is thoroughly evaluated on a large set of manually curated functional annotations and shows competitive performance in comparison to other suggested approaches. We also analyze the predictions for different functional annotation and taxonomy categories and measure the importance of different features for the task. This analysis reveals that the system is particularly efficient for bacterial protein sequences.

Download Full-text

Development of a structure based protein function prediction method: Calcium binding protein

Chem-Bio Informatics Journal ◽

10.1273/cbij.3.96 ◽

2003 ◽

Vol 3 ◽

pp. 96-113 ◽

Cited By ~ 3

Author(s):

Takeo Asaoka ◽

Tadashi Ando ◽

Toshiyuki Meguro ◽

Ichiro Yamato

Keyword(s):

Protein Function ◽

Calcium Binding ◽

Binding Protein ◽

Protein Function Prediction ◽

Prediction Method ◽

Function Prediction ◽

Calcium Binding Protein

Download Full-text

Unsupervised protein embeddings outperform hand-crafted sequence and structure features at predicting molecular function

Bioinformatics ◽

10.1093/bioinformatics/btaa701 ◽

2020 ◽

Cited By ~ 1

Author(s):

Amelia Villegas-Morcillo ◽

Stavros Makrodimitris ◽

Roeland C H J van Ham ◽

Angel M Gomez ◽

Victoria Sanchez ◽

...

Keyword(s):

Protein Function ◽

Prediction Models ◽

Protein Function Prediction ◽

3D Structure ◽

Function Prediction ◽

Feature Representation ◽

Training Data ◽

Supplementary Information ◽

Molecular Function ◽

Structure Information

Abstract Motivation Protein function prediction is a difficult bioinformatics problem. Many recent methods use deep neural networks to learn complex sequence representations and predict function from these. Deep supervised models require a lot of labeled training data which are not available for this task. However, a very large amount of protein sequences without functional labels is available. Results We applied an existing deep sequence model that had been pretrained in an unsupervised setting on the supervised task of protein molecular function prediction. We found that this complex feature representation is effective for this task, outperforming hand-crafted features such as one-hot encoding of amino acids, k-mer counts, secondary structure and backbone angles. Also, it partly negates the need for complex prediction models, as a two-layer perceptron was enough to achieve competitive performance in the third Critical Assessment of Functional Annotation benchmark. We also show that combining this sequence representation with protein 3D structure information does not lead to performance improvement, hinting that 3D structure is also potentially learned during the unsupervised pretraining. Availability and implementation Implementations of all used models can be found at https://github.com/stamakro/GCN-for-Structure-and-Function. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Phylo-PFP: improved automated protein function prediction using phylogenetic distance of distantly related sequences

Bioinformatics ◽

10.1093/bioinformatics/bty704 ◽

2018 ◽

Vol 35 (5) ◽

pp. 753-759 ◽

Cited By ~ 8

Author(s):

Aashish Jain ◽

Daisuke Kihara

Keyword(s):

Protein Function ◽

Transfer Functions ◽

Sequence Similarity ◽

Protein Function Prediction ◽

Prediction Method ◽

Query Protein ◽

Function Prediction ◽

Homology Search ◽

Supplementary Information ◽

Phylogenetic Distance

Abstract Motivation Function annotation of proteins is fundamental in contemporary biology across fields including genomics, molecular biology, biochemistry, systems biology and bioinformatics. Function prediction is indispensable in providing clues for interpreting omics-scale data as well as in assisting biologists to build hypotheses for designing experiments. As sequencing genomes is now routine due to the rapid advancement of sequencing technologies, computational protein function prediction methods have become increasingly important. A conventional method of annotating a protein sequence is to transfer functions from top hits of a homology search; however, this approach has substantial short comings including a low coverage in genome annotation. Results Here we have developed Phylo-PFP, a new sequence-based protein function prediction method, which mines functional information from a broad range of similar sequences, including those with a low sequence similarity identified by a PSI-BLAST search. To evaluate functional similarity between identified sequences and the query protein more accurately, Phylo-PFP reranks retrieved sequences by considering their phylogenetic distance. Compared to the Phylo-PFP’s predecessor, PFP, which was among the top ranked methods in the second round of the Critical Assessment of Functional Annotation (CAFA2), Phylo-PFP demonstrated substantial improvement in prediction accuracy. Phylo-PFP was further shown to outperform prediction programs to date that were ranked top in CAFA2. Availability and implementation Phylo-PFP web server is available for at http://kiharalab.org/phylo_pfp.php. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

3P282 FCANAL : a structure based protein function prediction method. Application to enzyme active sites and metal binding sites

Seibutsu Butsuri ◽

10.2142/biophys.44.s260_2 ◽

2004 ◽

Vol 44 (supplement) ◽

pp. S260

Author(s):

A. Suzuki ◽

T. Ando ◽

I. Yamato ◽

S. Miyazaki

Keyword(s):

Metal Binding ◽

Binding Sites ◽

Protein Function ◽

Active Sites ◽

Protein Function Prediction ◽

Prediction Method ◽

Function Prediction ◽

Metal Binding Sites ◽

Enzyme Active Sites

Download Full-text

2P303 FCANAL : A structure based protein function prediction method. Application to enzymes and binding proteins

Seibutsu Butsuri ◽

10.2142/biophys.45.s195_3 ◽

2005 ◽

Vol 45 (supplement) ◽

pp. S195

Author(s):

A. Suzuki ◽

T. Ando ◽

A. Matsumura ◽

H. Sakao ◽

I. Yamato ◽

...

Keyword(s):

Protein Function ◽

Binding Proteins ◽

Protein Function Prediction ◽

Prediction Method ◽

Function Prediction

Download Full-text

2P-242 FCANAL, structure-based protein function prediction method, applied to various types of proteins(Bioinformatics:Functional genomics,The 47th Annual Meeting of the Biophysical Society of Japan)

Seibutsu Butsuri ◽

10.2142/biophys.49.s145_1 ◽

2009 ◽

Vol 49 (supplement) ◽

pp. S145

Author(s):

Yuuichi Watanabe ◽

Kousuke Kaido ◽

Takashi Ando ◽

Ichiro Yamato ◽

Satoru Miyazaki

Keyword(s):

Annual Meeting ◽

Protein Function ◽

Protein Function Prediction ◽

Prediction Method ◽

Function Prediction ◽

Biophysical Society

Download Full-text

Hands-on on Protein Function Prediction with Machine Learning and Interactive Analytics

10.6019/tol.unip_machine-w.2018.00001.1 ◽

2018 ◽

Keyword(s):

Machine Learning ◽

Protein Function ◽

Protein Function Prediction ◽

Function Prediction ◽

Hands On

Download Full-text

Human Protein Function Prediction Enhancement Using Decision Tree Based Machine Learning Approach

Communications in Computer and Information Science - Information, Communication and Computing Technology ◽

10.1007/978-981-15-1384-8_23 ◽

2019 ◽

pp. 279-293

Author(s):

Sunny Sharma ◽

Gurvinder Singh ◽

Rajinder Singh

Keyword(s):

Machine Learning ◽

Decision Tree ◽

Protein Function ◽

Protein Function Prediction ◽

Function Prediction ◽

Human Protein ◽

Learning Approach ◽

Machine Learning Approach

Download Full-text

Multi-Instance Multilabel Learning with Weak-Label for Predicting Protein Function in Electricigens

BioMed Research International ◽

10.1155/2015/619438 ◽

2015 ◽

Vol 2015 ◽

pp. 1-9

Author(s):

Jian-Sheng Wu ◽

Hai-Feng Hu ◽

Shan-Cheng Yan ◽

Li-Hua Tang

Keyword(s):

Protein Function ◽

Microbial Fuel Cells ◽

Learning Algorithm ◽

Protein Function Prediction ◽

Function Prediction ◽

Vast Number ◽

Learning Framework ◽

Learning Tasks ◽

Multilabel Learning ◽

Protein Functions

Nature often brings several domains together to form multidomain and multifunctional proteins with a vast number of possibilities. In our previous study, we disclosed that the protein function prediction problem is naturally and inherently Multi-Instance Multilabel (MIML) learning tasks. Automated protein function prediction is typically implemented under the assumption that the functions of labeled proteins are complete; that is, there are no missing labels. In contrast, in practice just a subset of the functions of a protein are known, and whether this protein has other functions is unknown. It is evident that protein function prediction tasks suffer fromweak-labelproblem; thus protein function prediction with incomplete annotation matches well with the MIML with weak-label learning framework. In this paper, we have applied the state-of-the-art MIML with weak-label learning algorithm MIMLwel for predicting protein functions in two typical real-world electricigens organisms which have been widely used in microbial fuel cells (MFCs) researches. Our experimental results validate the effectiveness of MIMLwel algorithm in predicting protein functions with incomplete annotation.

Download Full-text