Machine learning for discovering missing or wrong protein function annotations

Abstract Background A massive amount of proteomic data is generated on a daily basis, nonetheless annotating all sequences is costly and often unfeasible. As a countermeasure, machine learning methods have been used to automatically annotate new protein functions. More specifically, many studies have investigated hierarchical multi-label classification (HMC) methods to predict annotations, using the Functional Catalogue (FunCat) or Gene Ontology (GO) label hierarchies. Most of these studies employed benchmark datasets created more than a decade ago, and thus train their models on outdated information. In this work, we provide an updated version of these datasets. By querying recent versions of FunCat and GO yeast annotations, we provide 24 new datasets in total. We compare four HMC methods, providing baseline results for the new datasets. Furthermore, we also evaluate whether the predictive models are able to discover new or wrong annotations, by training them on the old data and evaluating their results against the most recent information. Results The results demonstrated that the method based on predictive clustering trees, Clus-Ensemble, proposed in 2008, achieved superior results compared to more recent methods on the standard evaluation task. For the discovery of new knowledge, Clus-Ensemble performed better when discovering new annotations in the FunCat taxonomy, whereas hierarchical multi-label classification with genetic algorithm (HMC-GA), a method based on genetic algorithms, was overall superior when detecting annotations that were removed. In the GO datasets, Clus-Ensemble once again had the upper hand when discovering new annotations, HMC-GA performed better for detecting removed annotations. However, in this evaluation, there were less significant differences among the methods. Conclusions The experiments have showed that protein function prediction is a very challenging task which should be further investigated. We believe that the baseline results associated with the updated datasets provided in this work should be considered as guidelines for future studies, nonetheless the old versions of the datasets should not be disregarded since other tasks in machine learning could benefit from them.

Download Full-text

DeepGOZero: Improving protein function prediction from sequence and zero-shot learning based on ontology axioms

10.1101/2022.01.14.476325 ◽

2022 ◽

Author(s):

Maxat Kulmanov ◽

Robert Hoehndorf

Keyword(s):

Machine Learning ◽

Protein Function ◽

Protein Function Prediction ◽

Prediction Method ◽

Function Prediction ◽

Training Data ◽

Large Set ◽

Theoretic Approach ◽

Machine Learning Model ◽

Protein Functions

Motivation: Protein functions are often described using the Gene Ontology (GO) which is an ontology consisting of over 50,000 classes and a large set of formal axioms. Predicting the functions of proteins is one of the key challenges in computational biology and a variety of machine learning methods have been developed for this purpose. However, these methods usually require significant amount of training data and cannot make predictions for GO classes which have only few or no experimental annotations. Results: We developed DeepGOZero, a machine learning model which improves predictions for functions with no or only a small number of annotations. To achieve this goal, we rely on a model-theoretic approach for learning ontology embeddings and combine it with neural networks for protein function prediction. DeepGOZero can exploit formal axioms in the GO to make zero-shot predictions, i.e., predict protein functions even if not a single protein in the training phase was associated with that function. Furthermore, the zero-shot prediction method employed by DeepGOZero is generic and can be applied whenever associations with ontology classes need to be predicted. Availability: http://github.com/bio-ontology-research-group/deepgozero

Download Full-text

A Survey for Predicting ATP Binding Residues of Proteins Using Machine Learning Methods

Current Medicinal Chemistry ◽

10.2174/0929867328666210910125802 ◽

2021 ◽

Vol 28 ◽

Author(s):

Yu-He Yang ◽

Jia-Shu Wang ◽

Shi-Shi Yuan ◽

Meng-Lu Liu ◽

Wei Su ◽

...

Keyword(s):

Machine Learning ◽

Protein Function ◽

Vital Role ◽

Atp Binding ◽

Learning Methods ◽

Machine Learning Methods ◽

Protein Ligand Interactions ◽

Protein Functions ◽

Ligand Interactions ◽

Binding Residues

: Protein-ligand interactions are necessary for majority protein functions. Adenosine-5’-triphosphate (ATP) is one such ligand that plays vital role as a coenzyme in providing energy for cellular activities, catalyzing biological reaction and signaling. Knowing ATP binding residues of proteins is helpful for annotation of protein function and drug design. However, due to the huge amounts of protein sequences influx into databases in the post-genome era, experimentally identifying ATP binding residues is cost-ineffective and time-consuming. To address this problem, computational methods have been developed to predict ATP binding residues. In this review, we briefly summarized the application of machine learning methods in detecting ATP binding residues of proteins. We expect this review will be helpful for further research.

Download Full-text

A Novel Method for Functional Annotation Prediction Based on Combination of Classification Methods

The Scientific World JOURNAL ◽

10.1155/2014/542824 ◽

2014 ◽

Vol 2014 ◽

pp. 1-9

Author(s):

Jaehee Jung ◽

Heung Ki Lee ◽

Gangman Yi

Keyword(s):

Protein Function ◽

Protein Function Prediction ◽

Controlled Vocabulary ◽

Functional Annotations ◽

Functional Homology ◽

Large Sets ◽

Unknown Protein ◽

Protein Functions ◽

Novel Method ◽

The Relationship

Automated protein function prediction defines the designation of functions of unknown protein functions by using computational methods. This technique is useful to automatically assign gene functional annotations for undefined sequences in next generation genome analysis (NGS). NGS is a popular research method since high-throughput technologies such as DNA sequencing and microarrays have created large sets of genes. These huge sequences have greatly increased the need for analysis. Previous research has been based on the similarities of sequences as this is strongly related to the functional homology. However, this study aimed to designate protein functions by automatically predicting the function of the genome by utilizing InterPro (IPR), which can represent the properties of the protein family and groups of the protein function. Moreover, we used gene ontology (GO), which is the controlled vocabulary used to comprehensively describe the protein function. To define the relationship between IPR and GO terms, three pattern recognition techniques have been employed under different conditions, such as feature selection and weighted value, instead of a binary one.

Download Full-text

Hands-on on Protein Function Prediction with Machine Learning and Interactive Analytics

10.6019/tol.unip_machine-w.2018.00001.1 ◽

2018 ◽

Keyword(s):

Machine Learning ◽

Protein Function ◽

Protein Function Prediction ◽

Function Prediction ◽

Hands On

Download Full-text

Post-Translation Regulation of Influenza Virus Replication

Annual Review of Virology ◽

10.1146/annurev-virology-010320-070410 ◽

2020 ◽

Vol 7 (1) ◽

pp. 167-187

Author(s):

Anthony R. Dawson ◽

Gary M. Wilson ◽

Joshua J. Coon ◽

Andrew Mehle

Keyword(s):

Influenza Virus ◽

Protein Function ◽

Translation Regulation ◽

Host Use ◽

Host Proteins ◽

Post Translational Modifications ◽

Cellular Factors ◽

Protein Functions ◽

Influenza Virus Replication ◽

New Protein

Influenza virus exploits cellular factors to complete each step of viral replication. Yet, multiple host proteins actively block replication. Consequently, infection success depends on the relative speed and efficacy at which both the virus and host use their respective effectors. Post-translational modifications (PTMs) afford both the virus and the host means to readily adapt protein function without the need for new protein production. Here we use influenza virus to address concepts common to all viruses, reviewing how PTMs facilitate and thwart each step of the replication cycle. We also discuss advancements in proteomic methods that better characterize PTMs. Although some effectors and PTMs have clear pro- or antiviral functions, PTMs generally play regulatory roles to tune protein functions, levels, and localization. Synthesis of our current understanding reveals complex regulatory schemes where the effects of PTMs are time and context dependent as the virus and host battle to control infection.

Download Full-text

Human Protein Function Prediction Enhancement Using Decision Tree Based Machine Learning Approach

Communications in Computer and Information Science - Information, Communication and Computing Technology ◽

10.1007/978-981-15-1384-8_23 ◽

2019 ◽

pp. 279-293

Author(s):

Sunny Sharma ◽

Gurvinder Singh ◽

Rajinder Singh

Keyword(s):

Machine Learning ◽

Decision Tree ◽

Protein Function ◽

Protein Function Prediction ◽

Function Prediction ◽

Human Protein ◽

Learning Approach ◽

Machine Learning Approach

Download Full-text

Multi-Instance Multilabel Learning with Weak-Label for Predicting Protein Function in Electricigens

BioMed Research International ◽

10.1155/2015/619438 ◽

2015 ◽

Vol 2015 ◽

pp. 1-9

Author(s):

Jian-Sheng Wu ◽

Hai-Feng Hu ◽

Shan-Cheng Yan ◽

Li-Hua Tang

Keyword(s):

Protein Function ◽

Microbial Fuel Cells ◽

Learning Algorithm ◽

Protein Function Prediction ◽

Function Prediction ◽

Vast Number ◽

Learning Framework ◽

Learning Tasks ◽

Multilabel Learning ◽

Protein Functions

Nature often brings several domains together to form multidomain and multifunctional proteins with a vast number of possibilities. In our previous study, we disclosed that the protein function prediction problem is naturally and inherently Multi-Instance Multilabel (MIML) learning tasks. Automated protein function prediction is typically implemented under the assumption that the functions of labeled proteins are complete; that is, there are no missing labels. In contrast, in practice just a subset of the functions of a protein are known, and whether this protein has other functions is unknown. It is evident that protein function prediction tasks suffer fromweak-labelproblem; thus protein function prediction with incomplete annotation matches well with the MIML with weak-label learning framework. In this paper, we have applied the state-of-the-art MIML with weak-label learning algorithm MIMLwel for predicting protein functions in two typical real-world electricigens organisms which have been widely used in microbial fuel cells (MFCs) researches. Our experimental results validate the effectiveness of MIMLwel algorithm in predicting protein functions with incomplete annotation.

Download Full-text

Computational design of structured loops for new protein functions

Biological Chemistry ◽

10.1515/hsz-2018-0348 ◽

2019 ◽

Vol 400 (3) ◽

pp. 275-288 ◽

Cited By ~ 10

Author(s):

Kale Kundert ◽

Tanja Kortemme

Keyword(s):

Protein Design ◽

Protein Function ◽

Structure Prediction ◽

Computational Design ◽

Loop Structure ◽

Functional Sites ◽

Loop Design ◽

Routine Design ◽

Protein Functions ◽

New Protein

Abstract The ability to engineer the precise geometries, fine-tuned energetics and subtle dynamics that are characteristic of functional proteins is a major unsolved challenge in the field of computational protein design. In natural proteins, functional sites exhibiting these properties often feature structured loops. However, unlike the elements of secondary structures that comprise idealized protein folds, structured loops have been difficult to design computationally. Addressing this shortcoming in a general way is a necessary first step towards the routine design of protein function. In this perspective, we will describe the progress that has been made on this problem and discuss how recent advances in the field of loop structure prediction can be harnessed and applied to the inverse problem of computational loop design.

Download Full-text

DYNAMICALLY SEARCHING FOR A DOMAIN FOR PROTEIN FUNCTION PREDICTION

Journal of Bioinformatics and Computational Biology ◽

10.1142/s021972001350008x ◽

2013 ◽

Vol 11 (04) ◽

pp. 1350008 ◽

Cited By ~ 1

Author(s):

JINGYU HOU ◽

YONGQING JIANG

Keyword(s):

Protein Function ◽

Structural Information ◽

Protein Function Prediction ◽

Function Prediction ◽

Computational Approaches ◽

Protein Protein Interaction ◽

Protein Functions ◽

Novel Method ◽

Function Information ◽

The Relationship

The availability of large amounts of protein–protein interaction (PPI) data makes it feasible to use computational approaches to predict protein functions. The base of existing computational approaches is to exploit the known function information of annotated proteins in the PPI data to predict functions of un-annotated proteins. However, these approaches consider the prediction domain (i.e. the set of proteins from which the functions are predicted) as unchangeable during the prediction procedure. This may lead to valuable information being overwhelmed by the unavoidable noise information in the PPI data when predicting protein functions, and in turn, the prediction results will be distorted. In this paper, we propose a novel method to dynamically predict protein functions from the PPI data. Our method regards the function prediction as a dynamic process of finding a suitable prediction domain, from which representative functions of the domain are selected to predict functions of un-annotated proteins. Our method exploits the topological structural information of a PPI network and the semantic relationship between protein functions to measure the relationship between proteins, dynamically select a suitable prediction domain and predict functions. The evaluation on real PPI datasets demonstrated the effectiveness of our proposed method, and generated better prediction results.

Download Full-text

COMPUTATIONAL METHOD FOR PROTEIN FUNCTION PREDICTION BY CONSTRUCTING PROTEIN INTERACTION NETWORK DICTIONARY

International Journal of Pattern Recognition and Artificial Intelligence ◽

10.1142/s0218001406004661 ◽

2006 ◽

Vol 20 (02) ◽

pp. 285-295 ◽

Cited By ~ 2

Author(s):

HEE-JEONG JIN ◽

HWAN-GUE CHO

Keyword(s):

Protein Interaction ◽

Protein Interaction Network ◽

Protein Function ◽

Protein Function Prediction ◽

Interaction Network ◽

Computational Method ◽

Chi Square ◽

Protein Protein Interaction ◽

Protein Functions ◽

Novel Method

In the post-genomic era, predicting protein function is a challenging problem. It is difficult and burdensome work to unravel the functions of a protein by wet experiments only. In this paper, we propose a novel method to predict protein functions by building a "Protein Interaction Network Dictionary (PIND)". This method deduces the protein functions by searching the most similar "words"(an anagram of functions in neighbor proteins on a protein–protein interaction graph) using global alignments. An evaluation of sensitivity and specificity shows that this PIND approach outperforms previous approaches such as Majority Rule and Chi-Square measure, and that it competes with the recently introduced Random Markov Model approach.

Download Full-text