Isoform function prediction based on bi-random walks on a heterogeneous network

Guoxian Yu; Keyao Wang; Carlotta Domeniconi; Maozu Guo; Jun Wang

doi:10.1093/bioinformatics/btz535

Isoform function prediction based on bi-random walks on a heterogeneous network

Bioinformatics ◽

10.1093/bioinformatics/btz535 ◽

2019 ◽

Vol 36 (1) ◽

pp. 303-310 ◽

Cited By ~ 4

Author(s):

Guoxian Yu ◽

Keyao Wang ◽

Carlotta Domeniconi ◽

Maozu Guo ◽

Jun Wang

Keyword(s):

Random Walks ◽

Heterogeneous Network ◽

Expression Profiles ◽

Function Prediction ◽

Supplementary Information ◽

Receiver Operating Curve ◽

Genomic Databases ◽

Developmental Abnormalities ◽

Functional Annotations ◽

Gene Level

Abstract Motivation Alternative splicing contributes to the functional diversity of protein species and the proteoforms translated from alternatively spliced isoforms of a gene actually execute the biological functions. Computationally predicting the functions of genes has been studied for decades. However, how to distinguish the functional annotations of isoforms, whose annotations are essential for understanding developmental abnormalities and cancers, is rarely explored. The main bottleneck is that functional annotations of isoforms are generally unavailable and functional genomic databases universally store the functional annotations at the gene level. Results We propose IsoFun to accomplish Isoform Function prediction based on bi-random walks on a heterogeneous network. IsoFun firstly constructs an isoform functional association network based on the expression profiles of isoforms derived from multiple RNA-seq datasets. Next, IsoFun uses the available Gene Ontology annotations of genes, gene–gene interactions and the relations between genes and isoforms to construct a heterogeneous network. After this, IsoFun performs a tailored bi-random walk on the heterogeneous network to predict the association between GO terms and isoforms, thus accomplishing the prediction of GO annotations of isoforms. Experimental results show that IsoFun significantly outperforms the state-of-the-art algorithms and improves the area under the receiver-operating curve (AUROC) and the area under the precision-recall curve (AUPRC) by 17% and 44% at the gene-level, respectively. We further validated the performance of IsoFun on the genes ADAM15 and BCL2L1. IsoFun accurately differentiates the functions of respective isoforms of these two genes. Availability and implementation The code of IsoFun is available at http://mlda.swu.edu.cn/codes.php? name=IsoFun. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Differentiating isoform functions with collaborative matrix factorization

Bioinformatics ◽

10.1093/bioinformatics/btz847 ◽

2019 ◽

Author(s):

Keyao Wang ◽

Jun Wang ◽

Carlotta Domeniconi ◽

Xiangliang Zhang ◽

Guoxian Yu

Keyword(s):

Matrix Factorization ◽

Characteristic Curve ◽

Function Prediction ◽

Low Rank ◽

Data Matrix ◽

Supplementary Information ◽

Genomic Databases ◽

Gene Level ◽

The Matrix ◽

Level Function

Abstract Motivation Isoforms are alternatively spliced mRNAs of genes. They can be translated into different functional proteoforms, and thus greatly increase the functional diversity of protein variants (or proteoforms). Differentiating the functions of isoforms (or proteoforms) helps understanding the underlying pathology of various complex diseases at a deeper granularity. Since existing functional genomic databases uniformly record the annotations at the gene-level, and rarely record the annotations at the isoform-level, differentiating isoform functions is more challenging than the traditional gene-level function prediction. Results Several approaches have been proposed to differentiate the functions of isoforms. They generally follow the multi-instance learning paradigm by viewing each gene as a bag and the spliced isoforms as its instances, and push functions of bags onto instances. These approaches implicitly assume the collected annotations of genes are complete and only integrate multiple RNA-seq datasets. As such, they have compromised performance. We propose a data integrative solution (called DisoFun) to Differentiate isoform Functions with collaborative matrix factorization. DisoFun assumes the functional annotations of genes are aggregated from those of key isoforms. It collaboratively factorizes the isoform data matrix and gene-term data matrix (storing Gene Ontology (GO) annotations of genes) into low-rank matrices to simultaneously explore the latent key isoforms, and achieve function prediction by aggregating predictions to their originating genes. In addition, it leverages the PPI network and GO structure to further coordinate the matrix factorization. Extensive experimental results show that DisoFun improves the AUROC (area under the receiver-operating characteristic curve) and AUPRC (area under the precision-recall curve) of existing solutions by at least 7.7% and 28.9%, respectively. We further investigate DisoFun on four exemplar genes (LMNA, ADAM15, BCL2L1, and CFLAR) with known functions at the isoform-level, and observed that DisoFun can differentiate functions of their isoforms with 90.5% accuracy. Availability The code of DisoFun is available at mlda.swu.edu.cn/codes.php?name=DisoFun. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Accurate and efficient gene function prediction using a multi-bacterial network

Bioinformatics ◽

10.1093/bioinformatics/btaa885 ◽

2020 ◽

Author(s):

Jeffrey N Law ◽

Shiv D Kale ◽

T M Murali

Keyword(s):

Gene Function ◽

Bacterial Species ◽

Heterogeneous Data ◽

Function Prediction ◽

Label Propagation ◽

Supplementary Information ◽

Gene Function Prediction ◽

Functional Annotations ◽

A Genome ◽

Multiple Species

Abstract Motivation Nearly 40% of the genes in sequenced genomes have no experimentally or computationally derived functional annotations. To fill this gap, we seek to develop methods for network-based gene function prediction that can integrate heterogeneous data for multiple species with experimentally based functional annotations and systematically transfer them to newly sequenced organisms on a genome-wide scale. However, the large sizes of such networks pose a challenge for the scalability of current methods. Results We develop a label propagation algorithm called FastSinkSource. By formally bounding its rate of progress, we decrease the running time by a factor of 100 without sacrificing accuracy. We systematically evaluate many approaches to construct multi-species bacterial networks and apply FastSinkSource and other state-of-the-art methods to these networks. We find that the most accurate and efficient approach is to pre-compute annotation scores for species with experimental annotations, and then to transfer them to other organisms. In this manner, FastSinkSource runs in under 3 min for 200 bacterial species. Availability and implementation An implementation of our framework and all data used in this research are available at https://github.com/Murali-group/multi-species-GOA-prediction. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Accurate and Efficient Gene Function Prediction using a Multi-Bacterial Network

10.1101/646687 ◽

2019 ◽

Cited By ~ 1

Author(s):

Jeffrey Law ◽

Shiv Kale ◽

T. M. Murali

Keyword(s):

Gene Function ◽

Bacterial Species ◽

Heterogeneous Data ◽

Function Prediction ◽

Label Propagation ◽

Supplementary Information ◽

Supplementary File ◽

Gene Function Prediction ◽

Functional Annotations ◽

Multiple Species

AbstractMotivationNearly 40% of the genes in sequenced genomes have no experimentally- or computationally-derived functional annotations. To fill this gap, we seek to develop methods for network-based gene function prediction that can integrate heterogeneous data for multiple species with experimentally-based functional annotations and systematically transfer them to newly-sequenced organisms on a genomewide scale. However, the large size of such networks pose a challenge for the scalability of current methods.ResultsWe develop a label propagation algorithm called FastSinkSource. By formally bounding its the rate of progress, we decrease the running time by a factor of 100 without sacrificing accuracy. We systematically evaluate many approaches to construct multi-species bacterial networks and apply FastSinkSource and other state-of-the-art methods to these networks. We find that the most accurate and efficient approach is to pre-compute annotation scores for species with experimental annotations, and then to transfer them to other organisms. In this manner, FastSinkSource runs in under three minutes for 200 bacterial species.Availability and ImplementationPython implementations of each algorithm and all data used in this research are available at http://bioinformatics.cs.vt.edu/~jeffl/supplements/[email protected] InformationA supplementary file is available at bioRxiv online.

Download Full-text

sepal: identifying transcript profiles with spatial patterns by diffusion-based modeling

Bioinformatics ◽

10.1093/bioinformatics/btab164 ◽

2021 ◽

Author(s):

Alma Andersson ◽

Joakim Lundeberg

Keyword(s):

Spatial Patterns ◽

Expression Profiles ◽

Synthetic Data ◽

Real Data ◽

Cell Types ◽

Statistical Hypothesis ◽

Supplementary Information ◽

Statistical Hypothesis Testing ◽

Transcriptomics Data ◽

Transcript Profiles

Abstract Motivation Collection of spatial signals in large numbers has become a routine task in multiple omics-fields, but parsing of these rich datasets still pose certain challenges. In whole or near-full transcriptome spatial techniques, spurious expression profiles are intermixed with those exhibiting an organized structure. To distinguish profiles with spatial patterns from the background noise, a metric that enables quantification of spatial structure is desirable. Current methods designed for similar purposes tend to be built around a framework of statistical hypothesis testing, hence we were compelled to explore a fundamentally different strategy. Results We propose an unexplored approach to analyze spatial transcriptomics data, simulating diffusion of individual transcripts to extract genes with spatial patterns. The method performed as expected when presented with synthetic data. When applied to real data, it identified genes with distinct spatial profiles, involved in key biological processes or characteristic for certain cell types. Compared to existing methods, ours seemed to be less informed by the genes’ expression levels and showed better time performance when run with multiple cores. Availabilityand implementation Open-source Python package with a command line interface (CLI), freely available at https://github.com/almaan/sepal under an MIT licence. A mirror of the GitHub repository can be found at Zenodo, doi: 10.5281/zenodo.4573237. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Genome-wide inferring gene–phenotype relationship by walking on the heterogeneous network

Bioinformatics ◽

10.1093/bioinformatics/btq108 ◽

2010 ◽

Vol 26 (9) ◽

pp. 1219-1224 ◽

Cited By ~ 238

Author(s):

Yongjin Li ◽

Jagdish C. Patra

Keyword(s):

Heterogeneous Network ◽

Gene Network ◽

Genetic Diseases ◽

Supplementary Information ◽

Disease Genes ◽

Phenotypic Data ◽

Disease Associations ◽

Improved Performance ◽

Leave One Out ◽

Phenotype Network

Abstract Motivation: Clinical diseases are characterized by distinct phenotypes. To identify disease genes is to elucidate the gene–phenotype relationships. Mutations in functionally related genes may result in similar phenotypes. It is reasonable to predict disease-causing genes by integrating phenotypic data and genomic data. Some genetic diseases are genetically or phenotypically similar. They may share the common pathogenetic mechanisms. Identifying the relationship between diseases will facilitate better understanding of the pathogenetic mechanism of diseases. Results: In this article, we constructed a heterogeneous network by connecting the gene network and phenotype network using the phenotype–gene relationship information from the OMIM database. We extended the random walk with restart algorithm to the heterogeneous network. The algorithm prioritizes the genes and phenotypes simultaneously. We use leave-one-out cross-validation to evaluate the ability of finding the gene–phenotype relationship. Results showed improved performance than previous works. We also used the algorithm to disclose hidden disease associations that cannot be found by gene network or phenotype network alone. We identified 18 hidden disease associations, most of which were supported by literature evidence. Availability: The MATLAB code of the program is available at http://www3.ntu.edu.sg/home/aspatra/research/Yongjin_BI2010.zip Contact: [email protected] Supplementary information: Supplementary data are available at Bioinformatics online.

Download Full-text

Robust partial reference-free cell composition estimation from tissue expression

Bioinformatics ◽

10.1093/bioinformatics/btaa184 ◽

2020 ◽

Vol 36 (11) ◽

pp. 3431-3438

Author(s):

Ziyi Li ◽

Zhenxing Guo ◽

Ying Cheng ◽

Peng Jin ◽

Hao Wu

Keyword(s):

Expression Profiles ◽

Gene Expression Profiles ◽

Real Data ◽

Estimation Procedure ◽

Free Cell ◽

Biological Information ◽

Supplementary Information ◽

Tissue Samples ◽

Cell Composition ◽

Heterogeneous Tissues

Abstract Motivation In the analysis of high-throughput omics data from tissue samples, estimating and accounting for cell composition have been recognized as important steps. High cost, intensive labor requirements and technical limitations hinder the cell composition quantification using cell-sorting or single-cell technologies. Computational methods for cell composition estimation are available, but they are either limited by the availability of a reference panel or suffer from low accuracy. Results We introduce TOols for the Analysis of heterogeneouS Tissues TOAST/-P and TOAST/+P, two partial reference-free algorithms for estimating cell composition of heterogeneous tissues based on their gene expression profiles. TOAST/-P and TOAST/+P incorporate additional biological information, including cell-type-specific markers and prior knowledge of compositions, in the estimation procedure. Extensive simulation studies and real data analyses demonstrate that the proposed methods provide more accurate and robust cell composition estimation than existing methods. Availability and implementation The proposed methods TOAST/-P and TOAST/+P are implemented as part of the R/Bioconductor package TOAST at https://bioconductor.org/packages/TOAST. Contact [email protected] or [email protected] Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Modules in Biological Networks

Bioinformatics ◽

10.4018/978-1-4666-3604-0.ch034 ◽

2013 ◽

pp. 637-663

Author(s):

Bing Zhang ◽

Zhiao Shi

Keyword(s):

Biological Networks ◽

Protein Complex ◽

Protein Function ◽

Protein Function Prediction ◽

Single Gene ◽

Biological Systems ◽

Function Prediction ◽

Diverse Group ◽

Biological Studies ◽

Gene Level

One of the most prominent properties of networks representing complex systems is modularity. Network-based module identification has captured the attention of a diverse group of scientists from various domains and a variety of methods have been developed. The ability to decompose complex biological systems into modules allows the use of modules rather than individual genes as units in biological studies. A modular view is shaping research methods in biology. Module-based approaches have found broad applications in protein complex identification, protein function prediction, protein expression prediction, as well as disease studies. Compared to single gene-level analyses, module-level analyses offer higher robustness and sensitivity. More importantly, module-level analyses can lead to a better understanding of the design and organization of complex biological systems.

Download Full-text

Breakdown of multiple sclerosis genetics to identify an integrated disease network and potential variant mechanisms

Physiological Genomics ◽

10.1152/physiolgenomics.00120.2018 ◽

2019 ◽

Vol 51 (11) ◽

pp. 562-577

Author(s):

C. Joy Shepard ◽

Sara G. Cline ◽

David Hinds ◽

Seyedehameneh Jahanbakhsh ◽

Jeremy W. Prokop

Keyword(s):

Multiple Sclerosis ◽

Immune Cell ◽

Expression Profiles ◽

Missense Variant ◽

Cell Types ◽

Tissue Expression ◽

Specific Expression ◽

Genomic Databases ◽

Genetic Traits ◽

Immune System Dysfunction

Genetics of multiple sclerosis (MS) are highly polygenic with few insights into mechanistic associations with pathology. In this study, we assessed MS genetics through linkage disequilibrium and missense variant interpretation to yield a MS gene network. This network of 96 genes was taken through pathway analysis, tissue expression profiles, single cell expression segregation, expression quantitative trait loci (eQTLs), genome annotations, transcription factor (TF) binding profiles, structural genome looping, and overlap with additional associated genetic traits. This work revealed immune system dysfunction, nerve cell myelination, energetic control, transcriptional regulation, and variants that overlap multiple autoimmune disorders. Tissue-specific expression and eQTLs of MS genes implicate multiple immune cell types including macrophages, neutrophils, and T cells, while the genes in neural cell types enrich for oligodendrocyte and myelin sheath biology. There are eQTLs in linkage with lead MS variants in 25 genes including the multitissue eQTL, rs9271640, for HLA-DRB1/ DRB5. Using multiple functional genomic databases, we identified noncoding variants that disrupt TF binding for GABPA, CTCF, EGR1, YY1, SPI1, CLOCK, ARNTL, BACH1, and GFI1. Overall, this paper suggests multiple genetic mechanisms for MS associated variants while highlighting the importance of a systems biology and network approach when elucidating intersections of the immune and nervous system.

Download Full-text

The ortholog conjecture revisited: the value of orthologs and paralogs in function prediction

Bioinformatics ◽

10.1093/bioinformatics/btaa468 ◽

2020 ◽

Vol 36 (Supplement_1) ◽

pp. i219-i226 ◽

Cited By ~ 2

Author(s):

Moses Stamboulian ◽

Rafael F Guerrero ◽

Matthew W Hahn ◽

Predrag Radivojac

Keyword(s):

Prediction Accuracy ◽

Homo Sapiens ◽

Computational Prediction ◽

Function Prediction ◽

Supplementary Information ◽

Orthologous Genes ◽

Homologous Genes ◽

Cast Doubt ◽

Prediction Of Function ◽

Function Transfer

Abstract Motivation The computational prediction of gene function is a key step in making full use of newly sequenced genomes. Function is generally predicted by transferring annotations from homologous genes or proteins for which experimental evidence exists. The ‘ortholog conjecture’ proposes that orthologous genes should be preferred when making such predictions, as they evolve functions more slowly than paralogous genes. Previous research has provided little support for the ortholog conjecture, though the incomplete nature of the data cast doubt on the conclusions. Results We use experimental annotations from over 40 000 proteins, drawn from over 80 000 publications, to revisit the ortholog conjecture in two pairs of species: (i) Homo sapiens and Mus musculus and (ii) Saccharomyces cerevisiae and Schizosaccharomyces pombe. By making a distinction between questions about the evolution of function versus questions about the prediction of function, we find strong evidence against the ortholog conjecture in the context of function prediction, though questions about the evolution of function remain difficult to address. In both pairs of species, we quantify the amount of information that would be ignored if paralogs are discarded, as well as the resulting loss in prediction accuracy. Taken as a whole, our results support the view that the types of homologs used for function transfer are largely irrelevant to the task of function prediction. Maximizing the amount of data used for this task, regardless of whether it comes from orthologs or paralogs, is most likely to lead to higher prediction accuracy. Availability and implementation https://github.com/predragradivojac/oc. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

The CAFA challenge reports improved protein function prediction and new functional annotations for hundreds of genes through experimental screens

Genome Biology ◽

10.1186/s13059-019-1835-8 ◽

2019 ◽

Vol 20 (1) ◽

Cited By ~ 41

Author(s):

Naihui Zhou ◽

Yuxiang Jiang ◽

Timothy R. Bergquist ◽

Alexandra J. Lee ◽

Balint Z. Kacsoh ◽

...

Keyword(s):

Protein Function ◽

Functional Annotation ◽

Protein Function Prediction ◽

Mutation Screening ◽

Function Prediction ◽

Long Term Memory ◽

Functional Annotations ◽

Genome Wide ◽

New Development ◽

Working Together

Abstract Background The Critical Assessment of Functional Annotation (CAFA) is an ongoing, global, community-driven effort to evaluate and improve the computational annotation of protein function. Results Here, we report on the results of the third CAFA challenge, CAFA3, that featured an expanded analysis over the previous CAFA rounds, both in terms of volume of data analyzed and the types of analysis performed. In a novel and major new development, computational predictions and assessment goals drove some of the experimental assays, resulting in new functional annotations for more than 1000 genes. Specifically, we performed experimental whole-genome mutation screening in Candida albicans and Pseudomonas aureginosa genomes, which provided us with genome-wide experimental data for genes associated with biofilm formation and motility. We further performed targeted assays on selected genes in Drosophila melanogaster, which we suspected of being involved in long-term memory. Conclusion We conclude that while predictions of the molecular function and biological process annotations have slightly improved over time, those of the cellular component have not. Term-centric prediction of experimental annotations remains equally challenging; although the performance of the top methods is significantly better than the expectations set by baseline methods in C. albicans and D. melanogaster, it leaves considerable room and need for improvement. Finally, we report that the CAFA community now involves a broad range of participants with expertise in bioinformatics, biological experimentation, biocuration, and bio-ontologies, working together to improve functional annotation, computational function prediction, and our ability to manage big data in the era of large experimental screens.

Download Full-text