scholarly journals Machine learning for discovering missing or wrong protein function annotations

2019 ◽  
Vol 20 (1) ◽  
Author(s):  
Felipe Kenji Nakano ◽  
Mathias Lietaert ◽  
Celine Vens

Abstract Background A massive amount of proteomic data is generated on a daily basis, nonetheless annotating all sequences is costly and often unfeasible. As a countermeasure, machine learning methods have been used to automatically annotate new protein functions. More specifically, many studies have investigated hierarchical multi-label classification (HMC) methods to predict annotations, using the Functional Catalogue (FunCat) or Gene Ontology (GO) label hierarchies. Most of these studies employed benchmark datasets created more than a decade ago, and thus train their models on outdated information. In this work, we provide an updated version of these datasets. By querying recent versions of FunCat and GO yeast annotations, we provide 24 new datasets in total. We compare four HMC methods, providing baseline results for the new datasets. Furthermore, we also evaluate whether the predictive models are able to discover new or wrong annotations, by training them on the old data and evaluating their results against the most recent information. Results The results demonstrated that the method based on predictive clustering trees, Clus-Ensemble, proposed in 2008, achieved superior results compared to more recent methods on the standard evaluation task. For the discovery of new knowledge, Clus-Ensemble performed better when discovering new annotations in the FunCat taxonomy, whereas hierarchical multi-label classification with genetic algorithm (HMC-GA), a method based on genetic algorithms, was overall superior when detecting annotations that were removed. In the GO datasets, Clus-Ensemble once again had the upper hand when discovering new annotations, HMC-GA performed better for detecting removed annotations. However, in this evaluation, there were less significant differences among the methods. Conclusions The experiments have showed that protein function prediction is a very challenging task which should be further investigated. We believe that the baseline results associated with the updated datasets provided in this work should be considered as guidelines for future studies, nonetheless the old versions of the datasets should not be disregarded since other tasks in machine learning could benefit from them.

2022 ◽  
Author(s):  
Maxat Kulmanov ◽  
Robert Hoehndorf

Motivation: Protein functions are often described using the Gene Ontology (GO) which is an ontology consisting of over 50,000 classes and a large set of formal axioms. Predicting the functions of proteins is one of the key challenges in computational biology and a variety of machine learning methods have been developed for this purpose. However, these methods usually require significant amount of training data and cannot make predictions for GO classes which have only few or no experimental annotations. Results: We developed DeepGOZero, a machine learning model which improves predictions for functions with no or only a small number of annotations. To achieve this goal, we rely on a model-theoretic approach for learning ontology embeddings and combine it with neural networks for protein function prediction. DeepGOZero can exploit formal axioms in the GO to make zero-shot predictions, i.e., predict protein functions even if not a single protein in the training phase was associated with that function. Furthermore, the zero-shot prediction method employed by DeepGOZero is generic and can be applied whenever associations with ontology classes need to be predicted. Availability: http://github.com/bio-ontology-research-group/deepgozero


2021 ◽  
Vol 28 ◽  
Author(s):  
Yu-He Yang ◽  
Jia-Shu Wang ◽  
Shi-Shi Yuan ◽  
Meng-Lu Liu ◽  
Wei Su ◽  
...  

: Protein-ligand interactions are necessary for majority protein functions. Adenosine-5’-triphosphate (ATP) is one such ligand that plays vital role as a coenzyme in providing energy for cellular activities, catalyzing biological reaction and signaling. Knowing ATP binding residues of proteins is helpful for annotation of protein function and drug design. However, due to the huge amounts of protein sequences influx into databases in the post-genome era, experimentally identifying ATP binding residues is cost-ineffective and time-consuming. To address this problem, computational methods have been developed to predict ATP binding residues. In this review, we briefly summarized the application of machine learning methods in detecting ATP binding residues of proteins. We expect this review will be helpful for further research.


2014 ◽  
Vol 2014 ◽  
pp. 1-9
Author(s):  
Jaehee Jung ◽  
Heung Ki Lee ◽  
Gangman Yi

Automated protein function prediction defines the designation of functions of unknown protein functions by using computational methods. This technique is useful to automatically assign gene functional annotations for undefined sequences in next generation genome analysis (NGS). NGS is a popular research method since high-throughput technologies such as DNA sequencing and microarrays have created large sets of genes. These huge sequences have greatly increased the need for analysis. Previous research has been based on the similarities of sequences as this is strongly related to the functional homology. However, this study aimed to designate protein functions by automatically predicting the function of the genome by utilizing InterPro (IPR), which can represent the properties of the protein family and groups of the protein function. Moreover, we used gene ontology (GO), which is the controlled vocabulary used to comprehensively describe the protein function. To define the relationship between IPR and GO terms, three pattern recognition techniques have been employed under different conditions, such as feature selection and weighted value, instead of a binary one.


2020 ◽  
Vol 7 (1) ◽  
pp. 167-187
Author(s):  
Anthony R. Dawson ◽  
Gary M. Wilson ◽  
Joshua J. Coon ◽  
Andrew Mehle

Influenza virus exploits cellular factors to complete each step of viral replication. Yet, multiple host proteins actively block replication. Consequently, infection success depends on the relative speed and efficacy at which both the virus and host use their respective effectors. Post-translational modifications (PTMs) afford both the virus and the host means to readily adapt protein function without the need for new protein production. Here we use influenza virus to address concepts common to all viruses, reviewing how PTMs facilitate and thwart each step of the replication cycle. We also discuss advancements in proteomic methods that better characterize PTMs. Although some effectors and PTMs have clear pro- or antiviral functions, PTMs generally play regulatory roles to tune protein functions, levels, and localization. Synthesis of our current understanding reveals complex regulatory schemes where the effects of PTMs are time and context dependent as the virus and host battle to control infection.


2015 ◽  
Vol 2015 ◽  
pp. 1-9
Author(s):  
Jian-Sheng Wu ◽  
Hai-Feng Hu ◽  
Shan-Cheng Yan ◽  
Li-Hua Tang

Nature often brings several domains together to form multidomain and multifunctional proteins with a vast number of possibilities. In our previous study, we disclosed that the protein function prediction problem is naturally and inherently Multi-Instance Multilabel (MIML) learning tasks. Automated protein function prediction is typically implemented under the assumption that the functions of labeled proteins are complete; that is, there are no missing labels. In contrast, in practice just a subset of the functions of a protein are known, and whether this protein has other functions is unknown. It is evident that protein function prediction tasks suffer fromweak-labelproblem; thus protein function prediction with incomplete annotation matches well with the MIML with weak-label learning framework. In this paper, we have applied the state-of-the-art MIML with weak-label learning algorithm MIMLwel for predicting protein functions in two typical real-world electricigens organisms which have been widely used in microbial fuel cells (MFCs) researches. Our experimental results validate the effectiveness of MIMLwel algorithm in predicting protein functions with incomplete annotation.


2019 ◽  
Vol 400 (3) ◽  
pp. 275-288 ◽  
Author(s):  
Kale Kundert ◽  
Tanja Kortemme

Abstract The ability to engineer the precise geometries, fine-tuned energetics and subtle dynamics that are characteristic of functional proteins is a major unsolved challenge in the field of computational protein design. In natural proteins, functional sites exhibiting these properties often feature structured loops. However, unlike the elements of secondary structures that comprise idealized protein folds, structured loops have been difficult to design computationally. Addressing this shortcoming in a general way is a necessary first step towards the routine design of protein function. In this perspective, we will describe the progress that has been made on this problem and discuss how recent advances in the field of loop structure prediction can be harnessed and applied to the inverse problem of computational loop design.


2013 ◽  
Vol 11 (04) ◽  
pp. 1350008 ◽  
Author(s):  
JINGYU HOU ◽  
YONGQING JIANG

The availability of large amounts of protein–protein interaction (PPI) data makes it feasible to use computational approaches to predict protein functions. The base of existing computational approaches is to exploit the known function information of annotated proteins in the PPI data to predict functions of un-annotated proteins. However, these approaches consider the prediction domain (i.e. the set of proteins from which the functions are predicted) as unchangeable during the prediction procedure. This may lead to valuable information being overwhelmed by the unavoidable noise information in the PPI data when predicting protein functions, and in turn, the prediction results will be distorted. In this paper, we propose a novel method to dynamically predict protein functions from the PPI data. Our method regards the function prediction as a dynamic process of finding a suitable prediction domain, from which representative functions of the domain are selected to predict functions of un-annotated proteins. Our method exploits the topological structural information of a PPI network and the semantic relationship between protein functions to measure the relationship between proteins, dynamically select a suitable prediction domain and predict functions. The evaluation on real PPI datasets demonstrated the effectiveness of our proposed method, and generated better prediction results.


Author(s):  
HEE-JEONG JIN ◽  
HWAN-GUE CHO

In the post-genomic era, predicting protein function is a challenging problem. It is difficult and burdensome work to unravel the functions of a protein by wet experiments only. In this paper, we propose a novel method to predict protein functions by building a "Protein Interaction Network Dictionary (PIND)". This method deduces the protein functions by searching the most similar "words"(an anagram of functions in neighbor proteins on a protein–protein interaction graph) using global alignments. An evaluation of sensitivity and specificity shows that this PIND approach outperforms previous approaches such as Majority Rule and Chi-Square measure, and that it competes with the recently introduced Random Markov Model approach.


Sign in / Sign up

Export Citation Format

Share Document