NetQuilt: Deep Multispecies Network-based Protein Function Prediction using Homology-informed Network Similarity

Bioinformatics ◽

10.1093/bioinformatics/btab098 ◽

2021 ◽

Author(s):

Meet Barot ◽

Vladimir Gligorijević ◽

Kyunghyun Cho ◽

Richard Bonneau

Keyword(s):

Biological Networks ◽

Protein Function ◽

Functional Annotation ◽

Sequence Similarity ◽

Function Prediction ◽

Supplementary Information ◽

Learning Sequence ◽

Network Information ◽

Ppi Networks ◽

Multiple Species

Abstract Motivation Transferring knowledge between species is challenging: different species contain distinct proteomes and cellular architectures, which cause their proteins to carry out different functions via different interaction networks. Many approaches to protein functional annotation use sequence similarity to transfer knowledge between species. These approaches cannot produce accurate predictions for proteins without homologues of known function, as many functions require cellular context for meaningful prediction. To supply this context, network-based methods use protein-protein interaction (PPI) networks as a source of information for inferring protein function and have demonstrated promising results in function prediction. However, most of these methods are tied to a network for a single species, and many species lack biological networks. Results In this work, we integrate sequence and network information across multiple species by computing IsoRank similarity scores to create a meta-network profile of the proteins of multiple species. We use this integrated multispecies meta-network as input to train a maxout neural network with Gene Ontology terms as target labels. Our multispecies approach takes advantage of more training examples, and consequently leads to significant improvements in function prediction performance compared to two network-based methods, a deep learning sequence-based method, and the BLAST annotation method used in the Critial Assessment of Functional Annotation. We are able to demonstrate that our approach performs well even in cases where a species has no network information available: when an organism’s PPI network is left out we can use our multi-species method to make predictions for the left-out organism with good performance. Availability The code is freely available at https://github.com/nowittynamesleft/NetQuilt Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

NetQuilt: Deep Multispecies Network-based Protein Function Prediction using Homology-informed Network Similarity

10.1101/2020.07.30.227611 ◽

2020 ◽

Author(s):

Meet Barot ◽

Vladimir Gligorijevic ◽

Kyunghyun Cho ◽

Richard Bonneau

Keyword(s):

Biological Networks ◽

Protein Function ◽

Sequence Similarity ◽

Protein Function Prediction ◽

Single Species ◽

Function Prediction ◽

Alignment Algorithm ◽

Network Information ◽

Ppi Networks ◽

Multiple Species

Transferring knowledge between species is challenging: different species contain distinct proteomes and cellular architectures, which cause their proteins to carry out different functions via different interaction networks. Many approaches to proteome and biological network functional annotation use sequence similarity to transfer knowledge between species. These similarity-based approaches cannot produce accurate predictions for proteins without homologues of known function, as many functions require cellular or organismal context for meaningful function prediction. In order to supply this context, network-based methods use protein-protein interaction (PPI) networks as a source of information for inferring protein function and have demonstrated promising results in function prediction. However, the majority of these methods are tied to a network for a single species, and many species lack biological networks. In this work, we integrate sequence and network information across multiple species by applying an IsoRank-derived network alignment algorithm to create a meta-network profile of the proteins of multiple species. We then use this integrated multispecies meta-network as input features to train a maxout neural network with Gene Ontology terms as target labels. Our multispecies approach takes advantage of more training examples, and more diverse examples from multiple organisms, and consequently leads to significant improvements in function prediction performance. Further, we evaluate our approach in a setting in which an organism's PPI network is left out, using other organisms' network information and sequence homology in order to make predictions for the left-out organism, to simulate cases in which a newly sequenced species has no network information available.

Download Full-text

Phylo-PFP: improved automated protein function prediction using phylogenetic distance of distantly related sequences

Bioinformatics ◽

10.1093/bioinformatics/bty704 ◽

2018 ◽

Vol 35 (5) ◽

pp. 753-759 ◽

Cited By ~ 8

Author(s):

Aashish Jain ◽

Daisuke Kihara

Keyword(s):

Protein Function ◽

Transfer Functions ◽

Sequence Similarity ◽

Protein Function Prediction ◽

Prediction Method ◽

Query Protein ◽

Function Prediction ◽

Homology Search ◽

Supplementary Information ◽

Phylogenetic Distance

Abstract Motivation Function annotation of proteins is fundamental in contemporary biology across fields including genomics, molecular biology, biochemistry, systems biology and bioinformatics. Function prediction is indispensable in providing clues for interpreting omics-scale data as well as in assisting biologists to build hypotheses for designing experiments. As sequencing genomes is now routine due to the rapid advancement of sequencing technologies, computational protein function prediction methods have become increasingly important. A conventional method of annotating a protein sequence is to transfer functions from top hits of a homology search; however, this approach has substantial short comings including a low coverage in genome annotation. Results Here we have developed Phylo-PFP, a new sequence-based protein function prediction method, which mines functional information from a broad range of similar sequences, including those with a low sequence similarity identified by a PSI-BLAST search. To evaluate functional similarity between identified sequences and the query protein more accurately, Phylo-PFP reranks retrieved sequences by considering their phylogenetic distance. Compared to the Phylo-PFP’s predecessor, PFP, which was among the top ranked methods in the second round of the Critical Assessment of Functional Annotation (CAFA2), Phylo-PFP demonstrated substantial improvement in prediction accuracy. Phylo-PFP was further shown to outperform prediction programs to date that were ranked top in CAFA2. Availability and implementation Phylo-PFP web server is available for at http://kiharalab.org/phylo_pfp.php. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

DeepGOPlus: improved protein function prediction from sequence

Bioinformatics ◽

10.1093/bioinformatics/btz595 ◽

2019 ◽

Cited By ~ 17

Author(s):

Maxat Kulmanov ◽

Robert Hoehndorf

Keyword(s):

Protein Function ◽

Drug Targets ◽

Sequence Similarity ◽

Protein Function Prediction ◽

Function Prediction ◽

Supplementary Information ◽

Protein Protein Interaction ◽

Wide Range ◽

Protein Functions ◽

Novel Method

Abstract Motivation Protein function prediction is one of the major tasks of bioinformatics that can help in wide range of biological problems such as understanding disease mechanisms or finding drug targets. Many methods are available for predicting protein functions from sequence based features, protein–protein interaction networks, protein structure or literature. However, other than sequence, most of the features are difficult to obtain or not available for many proteins thereby limiting their scope. Furthermore, the performance of sequence-based function prediction methods is often lower than methods that incorporate multiple features and predicting protein functions may require a lot of time. Results We developed a novel method for predicting protein functions from sequence alone which combines deep convolutional neural network (CNN) model with sequence similarity based predictions. Our CNN model scans the sequence for motifs which are predictive for protein functions and combines this with functions of similar proteins (if available). We evaluate the performance of DeepGOPlus using the CAFA3 evaluation measures and achieve an Fmax of 0.390, 0.557 and 0.614 for BPO, MFO and CCO evaluations, respectively. These results would have made DeepGOPlus one of the three best predictors in CCO and the second best performing method in the BPO and MFO evaluations. We also compare DeepGOPlus with state-of-the-art methods such as DeepText2GO and GOLabeler on another dataset. DeepGOPlus can annotate around 40 protein sequences per second on common hardware, thereby making fast and accurate function predictions available for a wide range of proteins. Availability and implementation http://deepgoplus.bio2vec.net/. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Protein Function Prediction Based on PPI Networks: Network Reconstruction vs Edge Enrichment

Frontiers in Genetics ◽

10.3389/fgene.2021.758131 ◽

2021 ◽

Vol 12 ◽

Author(s):

Jiaogen Zhou ◽

Wei Xiong ◽

Yang Wang ◽

Jihong Guan

Keyword(s):

Protein Function ◽

Sequence Similarity ◽

Protein Function Prediction ◽

Network Reconstruction ◽

Function Prediction ◽

Protein Protein Interaction ◽

Ppi Networks ◽

Performance Differences ◽

Global Similarity ◽

Better Than

Over the past decades, massive amounts of protein-protein interaction (PPI) data have been accumulated due to the advancement of high-throughput technologies, and but data quality issues (noise or incompleteness) of PPI have been still affecting protein function prediction accuracy based on PPI networks. Although two main strategies of network reconstruction and edge enrichment have been reported on the effectiveness of boosting the prediction performance in numerous literature studies, there still lack comparative studies of the performance differences between network reconstruction and edge enrichment. Inspired by the question, this study first uses three protein similarity metrics (local, global and sequence) for network reconstruction and edge enrichment in PPI networks, and then evaluates the performance differences of network reconstruction, edge enrichment and the original networks on two real PPI datasets. The experimental results demonstrate that edge enrichment work better than both network reconstruction and original networks. Moreover, for the edge enrichment of PPI networks, the sequence similarity outperformes both local and global similarity. In summary, our study can help biologists select suitable pre-processing schemes and achieve better protein function prediction for PPI networks.

Download Full-text

Integration of relational and hierarchical network information for protein function prediction

BMC Bioinformatics ◽

10.1186/1471-2105-9-350 ◽

2008 ◽

Vol 9 (1) ◽

pp. 350 ◽

Cited By ~ 28

Author(s):

Xiaoyu Jiang ◽

Naoki Nariai ◽

Martin Steffen ◽

Simon Kasif ◽

Eric D Kolaczyk

Keyword(s):

Protein Function ◽

Protein Function Prediction ◽

Function Prediction ◽

Hierarchical Network ◽

Network Information

Download Full-text

Modules in Biological Networks

Bioinformatics ◽

10.4018/978-1-4666-3604-0.ch034 ◽

2013 ◽

pp. 637-663

Author(s):

Bing Zhang ◽

Zhiao Shi

Keyword(s):

Biological Networks ◽

Protein Complex ◽

Protein Function ◽

Protein Function Prediction ◽

Single Gene ◽

Biological Systems ◽

Function Prediction ◽

Diverse Group ◽

Biological Studies ◽

Gene Level

One of the most prominent properties of networks representing complex systems is modularity. Network-based module identification has captured the attention of a diverse group of scientists from various domains and a variety of methods have been developed. The ability to decompose complex biological systems into modules allows the use of modules rather than individual genes as units in biological studies. A modular view is shaping research methods in biology. Module-based approaches have found broad applications in protein complex identification, protein function prediction, protein expression prediction, as well as disease studies. Compared to single gene-level analyses, module-level analyses offer higher robustness and sensitivity. More importantly, module-level analyses can lead to a better understanding of the design and organization of complex biological systems.

Download Full-text

The CAFA challenge reports improved protein function prediction and new functional annotations for hundreds of genes through experimental screens

Genome Biology ◽

10.1186/s13059-019-1835-8 ◽

2019 ◽

Vol 20 (1) ◽

Cited By ~ 41

Author(s):

Naihui Zhou ◽

Yuxiang Jiang ◽

Timothy R. Bergquist ◽

Alexandra J. Lee ◽

Balint Z. Kacsoh ◽

...

Keyword(s):

Protein Function ◽

Functional Annotation ◽

Protein Function Prediction ◽

Mutation Screening ◽

Function Prediction ◽

Long Term Memory ◽

Functional Annotations ◽

Genome Wide ◽

New Development ◽

Working Together

Abstract Background The Critical Assessment of Functional Annotation (CAFA) is an ongoing, global, community-driven effort to evaluate and improve the computational annotation of protein function. Results Here, we report on the results of the third CAFA challenge, CAFA3, that featured an expanded analysis over the previous CAFA rounds, both in terms of volume of data analyzed and the types of analysis performed. In a novel and major new development, computational predictions and assessment goals drove some of the experimental assays, resulting in new functional annotations for more than 1000 genes. Specifically, we performed experimental whole-genome mutation screening in Candida albicans and Pseudomonas aureginosa genomes, which provided us with genome-wide experimental data for genes associated with biofilm formation and motility. We further performed targeted assays on selected genes in Drosophila melanogaster, which we suspected of being involved in long-term memory. Conclusion We conclude that while predictions of the molecular function and biological process annotations have slightly improved over time, those of the cellular component have not. Term-centric prediction of experimental annotations remains equally challenging; although the performance of the top methods is significantly better than the expectations set by baseline methods in C. albicans and D. melanogaster, it leaves considerable room and need for improvement. Finally, we report that the CAFA community now involves a broad range of participants with expertise in bioinformatics, biological experimentation, biocuration, and bio-ontologies, working together to improve functional annotation, computational function prediction, and our ability to manage big data in the era of large experimental screens.

Download Full-text

Accurate and efficient gene function prediction using a multi-bacterial network

Bioinformatics ◽

10.1093/bioinformatics/btaa885 ◽

2020 ◽

Author(s):

Jeffrey N Law ◽

Shiv D Kale ◽

T M Murali

Keyword(s):

Gene Function ◽

Bacterial Species ◽

Heterogeneous Data ◽

Function Prediction ◽

Label Propagation ◽

Supplementary Information ◽

Gene Function Prediction ◽

Functional Annotations ◽

A Genome ◽

Multiple Species

Abstract Motivation Nearly 40% of the genes in sequenced genomes have no experimentally or computationally derived functional annotations. To fill this gap, we seek to develop methods for network-based gene function prediction that can integrate heterogeneous data for multiple species with experimentally based functional annotations and systematically transfer them to newly sequenced organisms on a genome-wide scale. However, the large sizes of such networks pose a challenge for the scalability of current methods. Results We develop a label propagation algorithm called FastSinkSource. By formally bounding its rate of progress, we decrease the running time by a factor of 100 without sacrificing accuracy. We systematically evaluate many approaches to construct multi-species bacterial networks and apply FastSinkSource and other state-of-the-art methods to these networks. We find that the most accurate and efficient approach is to pre-compute annotation scores for species with experimental annotations, and then to transfer them to other organisms. In this manner, FastSinkSource runs in under 3 min for 200 bacterial species. Availability and implementation An implementation of our framework and all data used in this research are available at https://github.com/Murali-group/multi-species-GOA-prediction. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Unsupervised protein embeddings outperform hand-crafted sequence and structure features at predicting molecular function

Bioinformatics ◽

10.1093/bioinformatics/btaa701 ◽

2020 ◽

Cited By ~ 1

Author(s):

Amelia Villegas-Morcillo ◽

Stavros Makrodimitris ◽

Roeland C H J van Ham ◽

Angel M Gomez ◽

Victoria Sanchez ◽

...

Keyword(s):

Protein Function ◽

Prediction Models ◽

Protein Function Prediction ◽

3D Structure ◽

Function Prediction ◽

Feature Representation ◽

Training Data ◽

Supplementary Information ◽

Molecular Function ◽

Structure Information

Abstract Motivation Protein function prediction is a difficult bioinformatics problem. Many recent methods use deep neural networks to learn complex sequence representations and predict function from these. Deep supervised models require a lot of labeled training data which are not available for this task. However, a very large amount of protein sequences without functional labels is available. Results We applied an existing deep sequence model that had been pretrained in an unsupervised setting on the supervised task of protein molecular function prediction. We found that this complex feature representation is effective for this task, outperforming hand-crafted features such as one-hot encoding of amino acids, k-mer counts, secondary structure and backbone angles. Also, it partly negates the need for complex prediction models, as a two-layer perceptron was enough to achieve competitive performance in the third Critical Assessment of Functional Annotation benchmark. We also show that combining this sequence representation with protein 3D structure information does not lead to performance improvement, hinting that 3D structure is also potentially learned during the unsupervised pretraining. Availability and implementation Implementations of all used models can be found at https://github.com/stamakro/GCN-for-Structure-and-Function. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Algorithms for protein interaction networks

Biochemical Society Transactions ◽

10.1042/bst0330530 ◽

2005 ◽

Vol 33 (3) ◽

pp. 530-534 ◽

Cited By ~ 3

Author(s):

M. Lappe ◽

L. Holm

Keyword(s):

Protein Interactions ◽

Biological Networks ◽

Protein Function ◽

Large Scale ◽

Sequence Similarity ◽

Functional Characterization ◽

Interaction Networks ◽

Computational Techniques ◽

Interaction Patterns ◽

Main Challenge

The functional characterization of all genes and their gene products is the main challenge of the postgenomic era. Recent experimental and computational techniques have enabled the study of interactions among all proteins on a large scale. In this paper, approaches will be presented to exploit interaction information for the inference of protein structure, function, signalling pathways and ultimately entire interactomes. Interaction networks can be modelled as graphs, showing the operation of gene function in terms of protein interactions. Since the architecture of biological networks differs distinctly from random networks, these functional maps contain a signal that can be used for predictive purposes. Protein function and structure can be predicted by matching interaction patterns, without the requirement of sequence similarity. Moving on to a higher level definition of protein function, the question arises how to decompose complex networks into meaningful subsets. An algorithm will be demonstrated, which extracts whole signal-transduction pathways from noisy graphs derived from text-mining the biological literature. Finally, an algorithmic strategy is formulated that enables the proteomics community to build a reliable scaffold of the interactome in a fraction of the time compared with uncoordinated efforts.

Download Full-text