Neural network and random forest models in protein function prediction

AbstractOver the past decade, the demand for automated protein function prediction has increased due to the volume of newly sequenced proteins. In this paper, we address the function prediction task by developing an ensemble system automatically assigning Gene Ontology (GO) terms to the given input protein sequence.We develop an ensemble system which combines the GO predictions made by random forest (RF) and neural network (NN) classifiers. Both RF and NN models rely on features derived from BLAST sequence alignments, taxonomy and protein signature analysis tools. In addition, we report on experiments with a NN model that directly analyzes the amino acid sequence as its sole input, using a convolutional layer. The Swiss-Prot database is used as the training and evaluation data.In the CAFA3 evaluation, which relies on experimental verification of the functional predictions, our submitted ensemble model demonstrates competitive performance ranking among top-10 best-performing systems out of over 100 submitted systems. In this paper, we evaluate and further improve the CAFA3-submitted system. Our machine learning models together with the data pre-processing and feature generation tools are publicly available as an open source software athttps://github.com/TurkuNLP/CAFA3Author summaryUnderstanding the role and function of proteins in biological processes is fundamental for new biological discoveries. Whereas modern sequencing methods have led to a rapid growth of protein databases, the function of these sequences is often unknown and expensive to determine experimentally. This has spurred a lot of interest in predictive modelling of protein functions.We develop a machine learning system for annotating protein sequences with functional definitions selected from a vast set of predefined functions. The approach is based on a combination of neural network and random forest classifiers with features covering structural and taxonomic properties and sequence similarity. The system is thoroughly evaluated on a large set of manually curated functional annotations and shows competitive performance in comparison to other suggested approaches. We also analyze the predictions for different functional annotation and taxonomy categories and measure the importance of different features for the task. This analysis reveals that the system is particularly efficient for bacterial protein sequences.

Download Full-text

DeepGOZero: Improving protein function prediction from sequence and zero-shot learning based on ontology axioms

10.1101/2022.01.14.476325 ◽

2022 ◽

Author(s):

Maxat Kulmanov ◽

Robert Hoehndorf

Keyword(s):

Machine Learning ◽

Protein Function ◽

Protein Function Prediction ◽

Prediction Method ◽

Function Prediction ◽

Training Data ◽

Large Set ◽

Theoretic Approach ◽

Machine Learning Model ◽

Protein Functions

Motivation: Protein functions are often described using the Gene Ontology (GO) which is an ontology consisting of over 50,000 classes and a large set of formal axioms. Predicting the functions of proteins is one of the key challenges in computational biology and a variety of machine learning methods have been developed for this purpose. However, these methods usually require significant amount of training data and cannot make predictions for GO classes which have only few or no experimental annotations. Results: We developed DeepGOZero, a machine learning model which improves predictions for functions with no or only a small number of annotations. To achieve this goal, we rely on a model-theoretic approach for learning ontology embeddings and combine it with neural networks for protein function prediction. DeepGOZero can exploit formal axioms in the GO to make zero-shot predictions, i.e., predict protein functions even if not a single protein in the training phase was associated with that function. Furthermore, the zero-shot prediction method employed by DeepGOZero is generic and can be applied whenever associations with ontology classes need to be predicted. Availability: http://github.com/bio-ontology-research-group/deepgozero

Download Full-text

Neural Network and Random Forest Models in Protein Function Prediction

IEEE/ACM Transactions on Computational Biology and Bioinformatics ◽

10.1109/tcbb.2020.3044230 ◽

2020 ◽

pp. 1-1

Author(s):

Kai Hakala ◽

Suwisa Kaewphan ◽

Jari Bjorne ◽

Farrokh Mehryary ◽

Hans Moen ◽

...

Keyword(s):

Neural Network ◽

Random Forest ◽

Protein Function ◽

Protein Function Prediction ◽

Function Prediction ◽

Forest Models ◽

Random Forest Models

Download Full-text

ProLanGO: Protein Function Prediction Using Neural Machine Translation Based on a Recurrent Neural Network

Molecules ◽

10.3390/molecules22101732 ◽

2017 ◽

Vol 22 (10) ◽

pp. 1732 ◽

Cited By ~ 84

Author(s):

Renzhi Cao ◽

Colton Freitas ◽

Leong Chan ◽

Miao Sun ◽

Haiqing Jiang ◽

...

Keyword(s):

Neural Network ◽

Machine Translation ◽

Recurrent Neural Network ◽

Protein Function ◽

Protein Function Prediction ◽

Function Prediction ◽

Neural Machine Translation

Download Full-text

A Deep Neural Network Based Hierarchical Multi-Label Classifier for Protein Function Prediction

2019 International Conference on Computer, Information and Telecommunication Systems (CITS) ◽

10.1109/cits.2019.8862034 ◽

2019 ◽

Author(s):

Xin Yuan ◽

Weite Li ◽

Kui Lin ◽

Jinglu Hu

Keyword(s):

Neural Network ◽

Protein Function ◽

Deep Neural Network ◽

Protein Function Prediction ◽

Function Prediction

Download Full-text

Hands-on on Protein Function Prediction with Machine Learning and Interactive Analytics

10.6019/tol.unip_machine-w.2018.00001.1 ◽

2018 ◽

Keyword(s):

Machine Learning ◽

Protein Function ◽

Protein Function Prediction ◽

Function Prediction ◽

Hands On

Download Full-text

Human Protein Function Prediction Enhancement Using Decision Tree Based Machine Learning Approach

Communications in Computer and Information Science - Information, Communication and Computing Technology ◽

10.1007/978-981-15-1384-8_23 ◽

2019 ◽

pp. 279-293

Author(s):

Sunny Sharma ◽

Gurvinder Singh ◽

Rajinder Singh

Keyword(s):

Machine Learning ◽

Decision Tree ◽

Protein Function ◽

Protein Function Prediction ◽

Function Prediction ◽

Human Protein ◽

Learning Approach ◽

Machine Learning Approach

Download Full-text

Assessing the Performances of Protein Function Prediction Algorithms from the Perspectives of Identification Accuracy and False Discovery Rate

10.20944/preprints201711.0160.v1 ◽

2017 ◽

Author(s):

Chunyan Yu ◽

Xiaoxu Li ◽

Hong Yang ◽

Yinghong Li ◽

Weiwei Xue ◽

...

Keyword(s):

Machine Learning ◽

False Discovery Rate ◽

Protein Function ◽

Protein Function Prediction ◽

Function Prediction ◽

Machine Learning Algorithms ◽

Identification Accuracy ◽

Homologous Proteins ◽

Prediction Algorithms ◽

False Discovery

The knowledge of protein function is essential for the study of biological processes, the understanding of disease mechanism and the exploration of novel therapeutic target. Apart from experimental methods, a number of in-silico approaches have been developed and extensively used for protein function prediction. Among these approaches, BLAST predicts functions based on protein sequence similarity, and machine learning predicts functional families from protein sequences irrespective of their similarity, which complements BLAST and other methods in predicting diverse classes of proteins including distantly related proteins and homologous proteins of different functions. However, their identification accuracies and the false discovery rate have not yet been assessed so far, which greatly limits the usage of these prediction algorithms. Herein, a comprehensive comparison of the performances among four popular functional prediction algorithms (BLAST, SVM, PNN and KNN) was conducted. In particular, the performance of these algorithms were systematically assessed by four metrics (sensitivity, specificity, accuracy and Matthews correlation coefficient) based on the independent test datasets generated from 93 protein families defined by UniProtKB Keywords. Moreover, the false discovery rates of these algorithms were evaluated by scanning the genomes of four representative model species (homo sapiens, arabidopsis thaliana, saccharomyces cerevisiae and mycobacterium tuberculosis). As a result, the substantially higher sensitivity and stability of BLAST and SVM were observed compared with that of PNN and KNN. But the machine learning algorithms (PNN, KNN and SVM) were found capable of significantly reducing the false discovery rate (SVM < PNN ≈ KNN). In summary, this study comprehensively assessed the performance of four popular algorithms applied to protein function prediction, which could facilitate the selection of the most appropriate method in the related biomedical research.

Download Full-text

The Power of Universal Contextualized Protein Embeddings in Cross-species Protein Function Prediction

Evolutionary Bioinformatics ◽

10.1177/11769343211062608 ◽

2021 ◽

Vol 17 ◽

pp. 117693432110626

Author(s):

Irene van den Bent ◽

Stavros Makrodimitris ◽

Marcel Reinders

Keyword(s):

Protein Function ◽

Contextual Information ◽

Protein Function Prediction ◽

Protein Sequences ◽

Learning Task ◽

Function Prediction ◽

Training Data ◽

Language Models ◽

Molecular Function ◽

Test Species

Computationally annotating proteins with a molecular function is a difficult problem that is made even harder due to the limited amount of available labeled protein training data. Unsupervised protein embeddings partly circumvent this limitation by learning a universal protein representation from many unlabeled sequences. Such embeddings incorporate contextual information of amino acids, thereby modeling the underlying principles of protein sequences insensitive to the context of species. We used an existing pre-trained protein embedding method and subjected its molecular function prediction performance to detailed characterization, first to advance the understanding of protein language models, and second to determine areas of improvement. Then, we applied the model in a transfer learning task by training a function predictor based on the embeddings of annotated protein sequences of one training species and making predictions on the proteins of several test species with varying evolutionary distance. We show that this approach successfully generalizes knowledge about protein function from one eukaryotic species to various other species, outperforming both an alignment-based and a supervised-learning-based baseline. This implies that such a method could be effective for molecular function prediction in inadequately annotated species from understudied taxonomic kingdoms.

Download Full-text