A deep learning framework for predicting human essential genes from population and functional genomic data

Mapping Intimacies ◽

10.1101/2021.12.21.473690 ◽

2021 ◽

Author(s):

Troy M LaPolice ◽

Yi-Fei Huang

Keyword(s):

Deep Learning ◽

Genetic Disorders ◽

Genomic Data ◽

Essential Genes ◽

Computational Method ◽

Functional Genomic ◽

Loss Of Function ◽

Functional Genomic Data ◽

Learning Framework ◽

Limited Power

Being able to predict essential genes intolerant to loss-of-function (LOF) mutations can dramatically improve our ability to identify genes associated with genetic disorders. Numerous computational methods have recently been developed to predict human essential genes from population genomic data; however, the existing methods have limited power in pinpointing short essential genes due to the sparsity of polymorphisms in the human genome. Here we present an evolution-based deep learning model, DeepLOF, which integrates population and functional genomic data to improve gene essentiality prediction. Compared to previous methods, DeepLOF shows unmatched performance in predicting ClinGen haploinsufficient genes, mouse essential genes, and essential genes in human cell lines. Furthermore, DeepLOF discovers 109 potentially essential genes that are too short to be identified by previous methods. Altogether, DeepLOF is a powerful computational method to aid in the discovery of essential genes.

Download Full-text

Faculty Opinions recommendation of Finding function: evaluation methods for functional genomic data.

Faculty Opinions – Post-Publication Peer Review of the Biomedical Literature ◽

10.3410/f.1044091.496329 ◽

2006 ◽

Author(s):

Russ Altman

Keyword(s):

Genomic Data ◽

Evaluation Methods ◽

Function Evaluation ◽

Functional Genomic ◽

Functional Genomic Data

Download Full-text

Integrating large-scale functional genomic data to dissect the complexity of yeast regulatory networks

Nature Genetics ◽

10.1038/ng.167 ◽

2008 ◽

Vol 40 (7) ◽

pp. 854-861 ◽

Cited By ~ 361

Author(s):

Jun Zhu ◽

Bin Zhang ◽

Erin N Smith ◽

Becky Drees ◽

Rachel B Brem ◽

...

Keyword(s):

Regulatory Networks ◽

Large Scale ◽

Genomic Data ◽

Functional Genomic ◽

Functional Genomic Data

Download Full-text

A deep learning framework for imputing missing values in genomic data

10.1101/406066 ◽

2018 ◽

Cited By ~ 9

Author(s):

Yeping Lina Qiu ◽

Hong Zheng ◽

Olivier Gevaert

Keyword(s):

Deep Learning ◽

Missing Values ◽

Gc Content ◽

Genomic Data ◽

Missing At Random ◽

Denoising Autoencoder ◽

K Nearest Neighbors ◽

Learning Framework ◽

Value Decomposition ◽

Pan Cancer

AbstractMotivationThe presence of missing values is a frequent problem encountered in genomic data analysis. Lost data can be an obstacle to downstream analyses that require complete data matrices. State-of-the-art imputation techniques including Singular Value Decomposition (SVD) and K-Nearest Neighbors (KNN) based methods usually achieve good performances, but are computationally expensive especially for large datasets such as those involved in pan-cancer analysis.ResultsThis study describes a new method: a denoising autoencoder with partial loss (DAPL) as a deep learning based alternative for data imputation. Results on pan-cancer gene expression data and DNA methylation data from over 11,000 samples demonstrate significant improvement over standard denoising autoencoder for both data missing-at-random cases with a range of missing percentages, and missing-not-at-random cases based on expression level and GC-content. We discuss the advantages of DAPL over traditional imputation methods and show that it achieves comparable or better performance with less computational burden.Availabilityhttps://github.com/gevaertlab/[email protected]

Download Full-text

Testing the Ortholog Conjecture with Comparative Functional Genomic Data from Mammals

PLoS Computational Biology ◽

10.1371/journal.pcbi.1002073 ◽

2011 ◽

Vol 7 (6) ◽

pp. e1002073 ◽

Cited By ~ 117

Author(s):

Nathan L. Nehrt ◽

Wyatt T. Clark ◽

Predrag Radivojac ◽

Matthew W. Hahn

Keyword(s):

Genomic Data ◽

Functional Genomic ◽

Functional Genomic Data

Download Full-text

Chinese Glioma Genome Atlas (CGGA): A Comprehensive Resource with Functional Genomic Data for Chinese Glioma Patients

10.1101/2020.01.20.911982 ◽

2020 ◽

Cited By ~ 3

Author(s):

Zheng Zhao ◽

Ke’nan Zhang ◽

Qiangwei Wang ◽

Guanzhang Li ◽

Fan Zeng ◽

...

Keyword(s):

Dna Methylation ◽

Survival Data ◽

Messenger Rna ◽

Genomic Data ◽

Biological Research ◽

Analysis Tool ◽

Functional Genomic ◽

Functional Genomic Data ◽

Who Grade ◽

Genome Atlas

AbstractGliomas are the most common and malignant intracranial tumours in adults. Recent studies have shown that functional genomics greatly aids in the understanding of the pathophysiology and therapy of glioma. However, comprehensive genomic data and analysis platforms are relatively limited. In this study, we developed the Chinese Glioma Genome Atlas (CGGA, http://www.cgga.org.cn), a user-friendly data portal for storage and interactive exploration of multi-dimensional functional genomic data that includes nearly 2,000 primary and recurrent glioma samples from Chinese cohorts. CGGA currently provides access to whole-exome sequencing (286 samples), messenger RNA sequencing (1,018 samples) and microarray (301 samples), DNA methylation microarray (159 samples), and microRNA microarray (198 samples) data, as well as detailed clinical data (e.g., WHO grade, histological type, critical molecular genetic information, age, sex, chemoradiotherapy status and survival data). In addition, we developed an analysis tool to allow users to browse mutational, mRNA/microRNA expression, and DNA methylation profiles and perform survival and correlation analyses of specific glioma subtypes. CGGA greatly reduces the barriers between complex functional genomic data and glioma researchers who seek rapid, intuitive, and high-quality access to data resources and enables researchers to use these immeasurable data sources for biological research and clinical application. Importantly, the free provision of data will allow researchers to quickly generate and provide data to the research community.

Download Full-text

Pairwise comparisons across species are problematic when analyzing functional genomic data

10.1101/107177 ◽

2017 ◽

Cited By ~ 4

Author(s):

Casey W. Dunn ◽

Felipe Zapata ◽

Catriona Munro ◽

Stefan Siebert ◽

Andreas Hejnol

Keyword(s):

Gene Expression ◽

Evolutionary Process ◽

Opportunity To Learn ◽

Genomic Data ◽

The Other ◽

Evolutionary Relationships ◽

Pairwise Comparisons ◽

Functional Genomic ◽

Functional Genomic Data ◽

Comparative Functional Genomics

AbstractThere is considerable interest in comparing functional genomic data across species. One goal of such work is to provide an integrated understanding of genome and phenotype evolution. Most comparative functional genomic studies have relied on multiple pairwise comparisons between species, an approach that does not incorporate information about the evolutionary relationships among species. The statistical problems that arise from not considering these relationships can lead pairwise approaches to the wrong conclusions, and are a missed opportunity to learn about biology that can only be understood in an explicit phylogenetic context. Here we examine two recently published studies that compare gene expression across species with pairwise methods, and find reason to question the original conclusions of both. One study interpreted pairwise comparisons of gene expression as support for the ortholog conjecture, the hypothesis that orthologs tend to be more similar than paralogs. The other study interpreted pairwise comparisons of embryonic gene expression across distantly related animals as evidence for a distinct evolutionary process that gave rise to phyla. In each study, distinct patterns of pairwise similarity among species were originally interpreted as evidence of particular evolutionary processes, but instead we find they reflect species relationships. These reanalyses concretely demonstrate the inadequacy of pairwise comparisons for analyzing functional genomic data across species. It will be critical to adopt phylogenetic comparative methods in future functional genomic work. Fortunately, phylogenetic comparative biology is also a rapidly advancing field with many methods that can be directly applied to functional genomic data.SignificanceComparisons of genome function between species are providing important insight into the evolutionary origins of diversity. Here we demonstrate that comparative functional genomics studies can come to the wrong conclusions if they do not take the relationships of species into account and instead rely on pairwise comparisons between species, as is common practice. We re-examined two previously published studies and found problems with pairwise comparisons that draw both their original conclusions into question. One study had found support for the ortholog conjecture and the other had concluded that the evolution of gene expression was different between animal phyla than within them. Our results demonstrate that to answer evolutionary questions about genome function, it is critical to consider evolutionary relationships.

Download Full-text

Continuous-trait probabilistic model for comparing multi-species functional genomic data

10.1101/283093 ◽

2018 ◽

Author(s):

Yang Yang ◽

Quanquan Gu ◽

Yang Zhang ◽

Takayo Sasaki ◽

Julianna Crivello ◽

...

Keyword(s):

Comparative Analysis ◽

Molecular Mechanisms ◽

Probabilistic Models ◽

Phenotypic Diversity ◽

Genomic Data ◽

Primate Species ◽

Functional Genomic ◽

Functional Genomic Data ◽

Continuous Trait ◽

A Genome

SummaryA large amount of multi-species functional genomic data from high-throughput assays are becoming available to help understand the molecular mechanisms for phenotypic diversity across species. However, continuous-trait probabilistic models, which are key to such comparative analysis, remain underexplored. Here we develop a new model, called phylogenetic hidden Markov Gaussian processes (Phylo-HMGP), to simultaneously infer heterogeneous evolutionary states of functional genomic features in a genome-wide manner. Both simulation studies and real data application demonstrate the effectiveness of Phylo-HMGP. Importantly, we applied Phylo-HMGP to analyze a new cross-species DNA replication timing (RT) dataset from the same cell type in five primate species (human, chimpanzee, orangutan, gibbon, and green monkey). We demonstrate that our Phylo-HMGP model enables discovery of genomic regions with distinct evolutionary patterns of RT. Our method provides a generic framework for comparative analysis of multi-species continuous functional genomic signals to help reveal regions with conserved or lineage-specific regulatory roles.

Download Full-text

Continuous-Trait Probabilistic Model for Comparing Multi-species Functional Genomic Data

Cell Systems ◽

10.1016/j.cels.2018.05.022 ◽

2018 ◽

Vol 7 (2) ◽

pp. 208-218.e11 ◽

Cited By ~ 8

Author(s):

Yang Yang ◽

Quanquan Gu ◽

Yang Zhang ◽

Takayo Sasaki ◽

Julianna Crivello ◽

...

Keyword(s):

Probabilistic Model ◽

Genomic Data ◽

Functional Genomic ◽

Functional Genomic Data ◽

Continuous Trait

Download Full-text

Pairwise comparisons across species are problematic when analyzing functional genomic data

Proceedings of the National Academy of Sciences ◽

10.1073/pnas.1707515115 ◽

2018 ◽

Vol 115 (3) ◽

pp. E409-E417 ◽

Cited By ~ 40

Author(s):

Casey W. Dunn ◽

Felipe Zapata ◽

Catriona Munro ◽

Stefan Siebert ◽

Andreas Hejnol

Keyword(s):

Gene Expression ◽

Evolutionary Process ◽

Opportunity To Learn ◽

Genomic Data ◽

Pairwise Comparisons ◽

Functional Genomic ◽

Functional Genomic Data ◽

Genomic Studies ◽

Compare Gene Expression ◽

Missed Opportunity

There is considerable interest in comparing functional genomic data across species. One goal of such work is to provide an integrated understanding of genome and phenotype evolution. Most comparative functional genomic studies have relied on multiple pairwise comparisons between species, an approach that does not incorporate information about the evolutionary relationships among species. The statistical problems that arise from not considering these relationships can lead pairwise approaches to the wrong conclusions and are a missed opportunity to learn about biology that can only be understood in an explicit phylogenetic context. Here, we examine two recently published studies that compare gene expression across species with pairwise methods, and find reason to question the original conclusions of both. One study interpreted pairwise comparisons of gene expression as support for the ortholog conjecture, the hypothesis that orthologs tend to have more similar attributes (expression in this case) than paralogs. The other study interpreted pairwise comparisons of embryonic gene expression across distantly related animals as evidence for a distinct evolutionary process that gave rise to phyla. In each study, distinct patterns of pairwise similarity among species were originally interpreted as evidence of particular evolutionary processes, but instead, we find that they reflect species relationships. These reanalyses concretely show the inadequacy of pairwise comparisons for analyzing functional genomic data across species. It will be critical to adopt phylogenetic comparative methods in future functional genomic work. Fortunately, phylogenetic comparative biology is also a rapidly advancing field with many methods that can be directly applied to functional genomic data.

Download Full-text

Fast, scalable prediction of deleterious noncoding variants from functional and population genomic data

10.1101/069682 ◽

2016 ◽

Cited By ~ 3

Author(s):

Yi-Fei Huang ◽

Brad Gulko ◽

Adam Siepel

Keyword(s):

Predictive Power ◽

Tissue Specificity ◽

Large Fraction ◽

Genomic Data ◽

Computational Method ◽

Functional Genomic Data ◽

Protein Coding ◽

Population Genomic ◽

Fitness Consequences ◽

Simple Neural Network

AbstractAcross many species, a large fraction of genetic variants that influence phenotypes of interest is located outside of protein-coding genes, yet existing methods for identifying such variants have poor predictive power. Here, we introduce a new computational method, called LINSIGHT, that substantially improves the prediction of noncoding nucleotide sites at which mutations are likely to have deleterious fitness consequences, and which therefore are likely to be phenotypically important. LINSIGHT combines a simple neural network for functional genomic data with a probabilistic model of molecular evolution. The method is fast and highly scalable, enabling it to exploit the “Big Data” available in modern genomics. We show that LINSIGHT outperforms the best available methods in identifying human noncoding variants associated with inherited diseases. In addition, we apply LINSIGHT to an atlas of human enhancers and show that the fitness consequences at enhancers depend on cell-type, tissue specificity, and constraints at associated promoters.

Download Full-text