The CAFA challenge reports improved protein function prediction and new functional annotations for hundreds of genes through experimental screens

Abstract Background The Critical Assessment of Functional Annotation (CAFA) is an ongoing, global, community-driven effort to evaluate and improve the computational annotation of protein function. Results Here, we report on the results of the third CAFA challenge, CAFA3, that featured an expanded analysis over the previous CAFA rounds, both in terms of volume of data analyzed and the types of analysis performed. In a novel and major new development, computational predictions and assessment goals drove some of the experimental assays, resulting in new functional annotations for more than 1000 genes. Specifically, we performed experimental whole-genome mutation screening in Candida albicans and Pseudomonas aureginosa genomes, which provided us with genome-wide experimental data for genes associated with biofilm formation and motility. We further performed targeted assays on selected genes in Drosophila melanogaster, which we suspected of being involved in long-term memory. Conclusion We conclude that while predictions of the molecular function and biological process annotations have slightly improved over time, those of the cellular component have not. Term-centric prediction of experimental annotations remains equally challenging; although the performance of the top methods is significantly better than the expectations set by baseline methods in C. albicans and D. melanogaster, it leaves considerable room and need for improvement. Finally, we report that the CAFA community now involves a broad range of participants with expertise in bioinformatics, biological experimentation, biocuration, and bio-ontologies, working together to improve functional annotation, computational function prediction, and our ability to manage big data in the era of large experimental screens.

Download Full-text

The CAFA challenge reports improved protein function prediction and new functional annotations for hundreds of genes through experimental screens

10.1101/653105 ◽

2019 ◽

Cited By ~ 6

Author(s):

Naihui Zhou ◽

Yuxiang Jiang ◽

Timothy R Bergquist ◽

Alexandra J Lee ◽

Balint Z Kacsoh ◽

...

Keyword(s):

Protein Function ◽

Functional Annotation ◽

Protein Function Prediction ◽

Mutation Screening ◽

Function Prediction ◽

Long Term Memory ◽

Functional Annotations ◽

Genome Wide ◽

New Development ◽

Working Together

AbstractThe Critical Assessment of Functional Annotation (CAFA) is an ongoing, global, community-driven effort to evaluate and improve the computational annotation of protein function. Here we report on the results of the third CAFA challenge, CAFA3, that featured an expanded analysis over the previous CAFA rounds, both in terms of volume of data analyzed and the types of analysis performed. In a novel and major new development, computational predictions and assessment goals drove some of the experimental assays, resulting in new functional annotations for more than 1000 genes. Specifically, we performed experimental whole-genome mutation screening in Candida albicans and Pseudomonas aureginosa genomes, which provided us with genome-wide experimental data for genes associated with biofilm formation and motility (P. aureginosa only). We further performed targeted assays on selected genes in Drosophila melanogaster, which we suspected of being involved in long-term memory. We conclude that, while predictions of the molecular function and biological process annotations have slightly improved over time, those of the cellular component have not. Term-centric prediction of experimental annotations remains equally challenging; although the performance of the top methods is significantly better than expectations set by baseline methods in C. albicans and D. melanogaster, it leaves considerable room and need for improvement. We finally report that the CAFA community now involves a broad range of participants with expertise in bioinformatics, biological experimentation, biocuration, and bioontologies, working together to improve functional annotation, computational function prediction, and our ability to manage big data in the era of large experimental screens.

Download Full-text

FunPred 3.0: improved protein function prediction using protein interaction network

PeerJ ◽

10.7717/peerj.6830 ◽

2019 ◽

Vol 7 ◽

pp. e6830 ◽

Cited By ~ 1

Author(s):

Sovan Saha ◽

Piyali Chatterjee ◽

Subhadip Basu ◽

Mita Nasipuri ◽

Dariusz Plewczynski

Keyword(s):

Protein Interaction ◽

Protein Interactions ◽

Protein Interaction Network ◽

Protein Function ◽

Protein Function Prediction ◽

Experimental Studies ◽

Interaction Network ◽

Function Prediction ◽

The Self ◽

Functional Annotations

Proteins are the most versatile macromolecules in living systems and perform crucial biological functions. In the advent of the post-genomic era, the next generation sequencing is done routinely at the population scale for a variety of species. The challenging problem is to massively determine the functions of proteins that are yet not characterized by detailed experimental studies. Identification of protein functions experimentally is a laborious and time-consuming task involving many resources. We therefore propose the automated protein function prediction methodology using in silico algorithms trained on carefully curated experimental datasets. We present the improved protein function prediction tool FunPred 3.0, an extended version of our previous methodology FunPred 2, which exploits neighborhood properties in protein–protein interaction network (PPIN) and physicochemical properties of amino acids. Our method is validated using the available functional annotations in the PPIN network of Saccharomyces cerevisiae in the latest Munich information center for protein (MIPS) dataset. The PPIN data of S. cerevisiae in MIPS dataset includes 4,554 unique proteins in 13,528 protein–protein interactions after the elimination of the self-replicating and the self-interacting protein pairs. Using the developed FunPred 3.0 tool, we are able to achieve the mean precision, the recall and the F-score values of 0.55, 0.82 and 0.66, respectively. FunPred 3.0 is then used to predict the functions of unpredicted protein pairs (incomplete and missing functional annotations) in MIPS dataset of S. cerevisiae. The method is also capable of predicting the subcellular localization of proteins along with its corresponding functions. The code and the complete prediction results are available freely at: https://github.com/SovanSaha/FunPred-3.0.git.

Download Full-text

Genome-Wide Protein Function Prediction through Multi-Instance Multi-Label Learning

IEEE/ACM Transactions on Computational Biology and Bioinformatics ◽

10.1109/tcbb.2014.2323058 ◽

2014 ◽

Vol 11 (5) ◽

pp. 891-902 ◽

Cited By ~ 41

Author(s):

Jian-Sheng Wu ◽

Sheng-Jun Huang ◽

Zhi-Hua Zhou

Keyword(s):

Protein Function ◽

Protein Function Prediction ◽

Function Prediction ◽

Genome Wide

Download Full-text

Predicting Protein Functions from Protein Interaction Networks

International Journal of Knowledge Discovery in Bioinformatics ◽

10.4018/ijkdb.2012100104 ◽

2012 ◽

Vol 3 (4) ◽

pp. 50-70 ◽

Cited By ~ 1

Author(s):

Hon Nian Chua ◽

Limsoon Wong

Keyword(s):

Protein Interaction ◽

Protein Function ◽

Protein Function Prediction ◽

Functional Characterization ◽

Function Prediction ◽

Functional Annotations ◽

Protein Protein Interaction ◽

Protein Functions ◽

High Throughput Manner

Functional characterization of genes and their protein products is essential to biological and clinical research. Yet, there is still no reliable way of assigning functional annotations to proteins in a high-throughput manner. In this article, the authors provide an introduction to the task of automated protein function prediction. They discuss about the motivation for automated protein function prediction, the challenges faced in this task, as well as some approaches that are currently available. In particular, they take a closer look at methods that use protein-protein interaction for protein function prediction, elaborating on their underlying techniques and assumptions, as well as their strengths and limitations.

Download Full-text

Multi-Instance Metric Transfer Learning for Genome-Wide Protein Function Prediction

Scientific Reports ◽

10.1038/srep41831 ◽

2017 ◽

Vol 7 (1) ◽

Cited By ~ 7

Author(s):

Yonghui Xu ◽

Huaqing Min ◽

Qingyao Wu ◽

Hengjie Song ◽

Bicui Ye

Keyword(s):

Transfer Learning ◽

Protein Function ◽

Protein Function Prediction ◽

Function Prediction ◽

Genome Wide

Download Full-text

Predicting Protein Functions from Protein Interaction Networks

Biological Data Mining in Protein Interaction Networks ◽

10.4018/978-1-60566-398-2.ch012 ◽

2009 ◽

pp. 203-222 ◽

Cited By ~ 3

Author(s):

Hon Nian Chua ◽

Limsoon Wong

Keyword(s):

Protein Interaction ◽

Protein Function ◽

Protein Function Prediction ◽

Functional Characterization ◽

Function Prediction ◽

Functional Annotations ◽

Protein Protein Interaction ◽

Protein Functions ◽

High Throughput Manner

Functional characterization of genes and their protein products is essential to biological and clinical research. Yet, there is still no reliable way of assigning functional annotations to proteins in a high-throughput manner. In this chapter, the authors provide an introduction to the task of automated protein function prediction. They discuss about the motivation for automated protein function prediction, the challenges faced in this task, as well as some approaches that are currently available. In particular, they take a closer look at methods that use protein-protein interaction for protein function prediction, elaborating on their underlying techniques and assumptions, as well as their strengths and limitations.

Download Full-text

Integrated Network Approach to Protein Function Prediction

Information Technology and Management Science ◽

10.7250/itms-2018-0016 ◽

2018 ◽

Vol 21 ◽

pp. 98-103

Author(s):

Natalia Novoselova ◽

Igar Tom

Keyword(s):

Protein Function ◽

Protein Function Prediction ◽

Function Prediction ◽

Biological Data ◽

Label Propagation ◽

Integrated Network ◽

Functional Annotations ◽

Additional Information ◽

Integration Schemes ◽

Protein Functions

One of the main problems in functional genomics is the prediction of the unknown gene/protein functions. With the rapid increase of high-throughput technologies, the vast amount of biological data describing different aspects of cellular functioning became available and made it possible to use them as the additional information sources for function prediction and to improve their accuracy.In our research, we have described an approach to protein function prediction on the basis of integration of several biological datasets. Initially, each dataset is presented in the form of a graph (or network), where the nodes represent genes or their products and the edges represent physical, functional or chemical relationships between nodes. The integration process makes it possible to estimate the network importance for the prediction of a particular function taking into account the imbalance between the functional annotations, notably the disproportion between positively and negatively annotated proteins. The protein function prediction consists in applying the label propagation algorithm to the integrated biological network in order to annotate the unknown proteins or determine the new function to already known proteins. The comparative analysis of the prediction efficiency with several integration schemes shows the positive effect in terms of several performance measures.

Download Full-text

Probabilistic Protein Function Prediction from Heterogeneous Genome-Wide Data

PLoS ONE ◽

10.1371/journal.pone.0000337 ◽

2007 ◽

Vol 2 (3) ◽

pp. e337 ◽

Cited By ~ 58

Author(s):

Naoki Nariai ◽

Eric D. Kolaczyk ◽

Simon Kasif

Keyword(s):

Protein Function ◽

Protein Function Prediction ◽

Function Prediction ◽

Genome Wide ◽

Genome Wide Data

Download Full-text

Multi-instance multi-label distance metric learning for genome-wide protein function prediction

Computational Biology and Chemistry ◽

10.1016/j.compbiolchem.2016.02.011 ◽

2016 ◽

Vol 63 ◽

pp. 30-40 ◽

Cited By ~ 6

Author(s):

Yonghui Xu ◽

Huaqing Min ◽

Hengjie Song ◽

Qingyao Wu

Keyword(s):

Protein Function ◽

Protein Function Prediction ◽

Metric Learning ◽

Function Prediction ◽

Distance Metric Learning ◽

Distance Metric ◽

Genome Wide

Download Full-text

Accurate Protein Function Prediction via Graph Attention Networks with Predicted Structure Information

10.1101/2021.06.16.448727 ◽

2021 ◽

Author(s):

Boqiao Lai ◽

Jinbo Xu

Keyword(s):

Protein Function ◽

Structure Prediction ◽

Protein Function Prediction ◽

Function Prediction ◽

Language Models ◽

Structure Information ◽

Functional Annotations ◽

Residue Contact ◽

Sequence Identity ◽

Contact Graphs

Experimental protein function annotation does not scale with the fast-growing sequence databases. Only a tiny fraction (<0.1%) of protein sequences in UniProtKB has experimentally determined functional annotations. Computational methods may predict protein function in a high-throughput way, but its accuracy is not very satisfactory. Based upon recent breakthroughs in protein structure prediction and protein language models, we develop GAT-GO, a graph attention network (GAT) method that may substantially improve protein function prediction by leveraging predicted inter-residue contact graphs and protein sequence embedding. Our experimental results show that GAT-GO greatly outperforms the latest sequence- and structure-based deep learning methods. On the PDB-mmseqs testset where the train and test proteins share <15% sequence identity, GAT-GO yields Fmax(maximum F-score) 0.508, 0.416, 0.501, and AUPRC(area under the precision-recall curve) 0.427, 0.253, 0.411 for the MFO, BPO, CCO ontology domains, respectively, much better than homology-based method BLAST (Fmax 0.117,0.121,0.207 and AUPRC 0.120, 0.120, 0.163). On the PDB-cdhit testset where the training and test proteins share higher sequence identity, GAT-GO obtains Fmax 0.637, 0.501, 0.542 for the MFO, BPO, CCO ontology domains, respectively, and AUPRC 0.662, 0.384, 0.481, significantly exceeding the just-published graph convolution method DeepFRI, which has Fmax 0.542, 0.425, 0.424 and AUPRC 0.313, 0.159, 0.193.

Download Full-text