Deep Learning Enables Fast and Accurate Imputation of Gene Expression

A question of fundamental biological significance is to what extent the expression of a subset of genes can be used to recover the full transcriptome, with important implications for biological discovery and clinical application. To address this challenge, we propose two novel deep learning methods, PMI and GAIN-GTEx, for gene expression imputation. In order to increase the applicability of our approach, we leverage data from GTEx v8, a reference resource that has generated a comprehensive collection of transcriptomes from a diverse set of human tissues. We show that our approaches compare favorably to several standard and state-of-the-art imputation methods in terms of predictive performance and runtime in two case studies and two imputation scenarios. In comparison conducted on the protein-coding genes, PMI attains the highest performance in inductive imputation whereas GAIN-GTEx outperforms the other methods in in-place imputation. Furthermore, our results indicate strong generalization on RNA-Seq data from 3 cancer types across varying levels of missingness. Our work can facilitate a cost-effective integration of large-scale RNA biorepositories into genomic studies of disease, with high applicability across diverse tissue types.

Download Full-text

Gene Expression Imputation with Generative Adversarial Imputation Nets

10.1101/2020.06.09.141689 ◽

2020 ◽

Author(s):

Ramon Viñas ◽

Tiago Azevedo ◽

Eric R. Gamazon ◽

Pietro Liò

Keyword(s):

Gene Expression ◽

Large Scale ◽

Biological Significance ◽

Predictive Performance ◽

Cost Effective ◽

Rna Seq ◽

Comprehensive Collection ◽

Genomic Studies ◽

Biological Discovery ◽

Cancer Types

AbstractA question of fundamental biological significance is to what extent the expression of a subset of genes can be used to recover the full transcriptome, with important implications for biological discovery and clinical application. To address this challenge, we present GAIN-GTEx, a method for gene expression imputation based on Generative Adversarial Imputation Networks. In order to increase the applicability of our approach, we leverage data from GTEx v8, a reference resource that has generated a comprehensive collection of transcriptomes from a diverse set of human tissues. We compare our model to several standard and state-of-the-art imputation methods and show that GAIN-GTEx is significantly superior in terms of predictive performance and runtime. Furthermore, our results indicate strong generalisation on RNA-Seq data from 3 cancer types across varying levels of missingness. Our work can facilitate a cost-effective integration of large-scale RNA biorepositories into genomic studies of disease, with high applicability across diverse tissue types.

Download Full-text

A Joint Deep Learning Model for Simultaneous Batch Effect Correction, Denoising and Clustering in Single-Cell Transcriptomics

10.1101/2020.09.23.310003 ◽

2020 ◽

Cited By ~ 1

Author(s):

Justin Lakkis ◽

David Wang ◽

Yuanchao Zhang ◽

Gang Hu ◽

Kui Wang ◽

...

Keyword(s):

Gene Expression ◽

Deep Learning ◽

Single Cell ◽

Large Scale ◽

Nearest Neighbor ◽

Learning Model ◽

Batch Effect ◽

Marker Genes ◽

Deep Learning Model ◽

Variable Genes

AbstractRecent development of single-cell RNA-seq (scRNA-seq) technologies has led to enormous biological discoveries. As the scale of scRNA-seq studies increases, a major challenge in analysis is batch effect, which is inevitable in studies involving human tissues. Most existing methods remove batch effect in a low-dimensional embedding space. Although useful for clustering, batch effect is still present in the gene expression space, leaving downstream gene-level analysis susceptible to batch effect. Recent studies have shown that batch effect correction in the gene expression space is much harder than in the embedding space. Popular methods such as Seurat3.0 rely on the mutual nearest neighbor (MNN) approach to remove batch effect in the gene expression space, but MNN can only analyze two batches at a time and it becomes computationally infeasible when the number of batches is large. Here we present CarDEC, a joint deep learning model that simultaneously clusters and denoises scRNA-seq data, while correcting batch effect both in the embedding and the gene expression space. Comprehensive evaluations spanning different species and tissues showed that CarDEC consistently outperforms scVI, DCA, and MNN. With CarDEC denoising, those non-highly variable genes offer as much signal for clustering as the highly variable genes, suggesting that CarDEC substantially boosted information content in scRNA-seq. We also showed that trajectory analysis using CarDEC’s denoised and batch corrected expression as input revealed marker genes and transcription factors that are otherwise obscured in the presence of batch effect. CarDEC is computationally fast, making it a desirable tool for large-scale scRNA-seq studies.

Download Full-text

Genome-wide profiling of transcribed enhancers during macrophage activation

10.1101/163519 ◽

2017 ◽

Author(s):

Elena Denisenko ◽

Reto Guler ◽

Musa Mhlanga ◽

Harukazu Suzuki ◽

Frank Brombacher ◽

...

Keyword(s):

Gene Expression ◽

Transcriptional Activation ◽

Large Scale ◽

Transcriptional Control ◽

Macrophage Activation ◽

Transcriptional Responses ◽

Protein Coding ◽

Genome Wide ◽

Ifn Γ ◽

Cap Analysis

AbstractMacrophages are sentinel cells essential for tissue homeostasis and host defence. Owing to their plasticity, macrophages acquire a range of functional phenotypes in response to microenvironmental stimuli, of which M(IFN-γ) and M(IL-4/IL-13) are well-known for their opposing pro- and anti-inflammatory roles. Enhancers have emerged as regulatory DNA elements crucial for transcriptional activation of gene expression. Using cap analysis of gene expression and epigenetic data, we identify on large-scale transcribed enhancers in mouse macrophages, their time kinetics and target protein-coding genes. We observe an increase in target gene expression, concomitant with increasing numbers of associated enhancers and find that genes associated to many enhancers show a shift towards stronger enrichment for macrophage-specific biological processes. We infer enhancers that drive transcriptional responses of genes upon M(IFN-γ) and M(IL-4/IL-13) macrophage activation and demonstrate stimuli-specificity of regulatory associations. Finally, we show that enhancer regions are enriched for binding sites of inflammation-related transcription factors, suggesting a link between stimuli response and enhancer transcriptional control. Our study provides new insights into genome-wide enhancer-mediated transcriptional control of macrophage genes, including those implicated in macrophage activation, and offers a detailed genome-wide catalogue to further elucidate enhancer regulation in macrophages.

Download Full-text

Advanced Machine Learning Models for Large Scale Gene Expression Analysis in Cancer Classification: Deep Learning Versus Classical Models

Communications in Computer and Information Science - Big Data, Cloud and Applications ◽

10.1007/978-3-319-96292-4_17 ◽

2018 ◽

pp. 210-221

Author(s):

Imene Zenbout ◽

Souham Meshoul

Keyword(s):

Gene Expression ◽

Machine Learning ◽

Deep Learning ◽

Expression Analysis ◽

Large Scale ◽

Gene Expression Analysis ◽

Cancer Classification ◽

Learning Models ◽

Classical Models ◽

Machine Learning Models

Download Full-text

Deep-Kcr: accurate detection of lysine crotonylation sites using deep learning method

Briefings in Bioinformatics ◽

10.1093/bib/bbaa255 ◽

2020 ◽

Author(s):

Hao Lv ◽

Fu-Ying Dao ◽

Zheng-Xing Guan ◽

Hui Yang ◽

Yan-Wen Li ◽

...

Keyword(s):

Deep Learning ◽

Large Scale ◽

Short Term Memory ◽

Information Gain ◽

Independent Set ◽

Cost Effective ◽

Cellular Regulation ◽

Proposed Model ◽

Experimental Approaches ◽

Memory Network

Abstract As a newly discovered protein posttranslational modification, histone lysine crotonylation (Kcr) involved in cellular regulation and human diseases. Various proteomics technologies have been developed to detect Kcr sites. However, experimental approaches for identifying Kcr sites are often time-consuming and labor-intensive, which is difficult to widely popularize in large-scale species. Computational approaches are cost-effective and can be used in a high-throughput manner to generate relatively precise identification. In this study, we develop a deep learning-based method termed as Deep-Kcr for Kcr sites prediction by combining sequence-based features, physicochemical property-based features and numerical space-derived information with information gain feature selection. We investigate the performances of convolutional neural network (CNN) and five commonly used classifiers (long short-term memory network, random forest, LogitBoost, naive Bayes and logistic regression) using 10-fold cross-validation and independent set test. Results show that CNN could always display the best performance with high computational efficiency on large dataset. We also compare the Deep-Kcr with other existing tools to demonstrate the excellent predictive power and robustness of our method. Based on the proposed model, a webserver called Deep-Kcr was established and is freely accessible at http://lin-group.cn/server/Deep-Kcr.

Download Full-text

Mining influential genes based on deep learning

BMC Bioinformatics ◽

10.1186/s12859-021-03972-5 ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Lingpeng Kong ◽

Yuanyuan Chen ◽

Fengjiao Xu ◽

Mingmin Xu ◽

Zutan Li ◽

...

Keyword(s):

Gene Expression ◽

Deep Learning ◽

Expression Profiling ◽

Large Scale ◽

Target Genes ◽

Expression Profiles ◽

Pearson Correlation ◽

Genomic Information ◽

Deep Mine ◽

Functional Connections

Abstract Background Currently, large-scale gene expression profiling has been successfully applied to the discovery of functional connections among diseases, genetic perturbation, and drug action. To address the cost of an ever-expanding gene expression profile, a new, low-cost, high-throughput reduced representation expression profiling method called L1000 was proposed, with which one million profiles were produced. Although a set of ~ 1000 carefully chosen landmark genes that can capture ~ 80% of information from the whole genome has been identified for use in L1000, the robustness of using these landmark genes to infer target genes is not satisfactory. Therefore, more efficient computational methods are still needed to deep mine the influential genes in the genome. Results Here, we propose a computational framework based on deep learning to mine a subset of genes that can cover more genomic information. Specifically, an AutoEncoder framework is first constructed to learn the non-linear relationship between genes, and then DeepLIFT is applied to calculate gene importance scores. Using this data-driven approach, we have re-obtained a landmark gene set. The result shows that our landmark genes can predict target genes more accurately and robustly than that of L1000 based on two metrics [mean absolute error (MAE) and Pearson correlation coefficient (PCC)]. This reveals that the landmark genes detected by our method contain more genomic information. Conclusions We believe that our proposed framework is very suitable for the analysis of biological big data to reveal the mysteries of life. Furthermore, the landmark genes inferred from this study can be used for the explosive amplification of gene expression profiles to facilitate research into functional connections.

Download Full-text

Modeling drug mechanism of action with large scale gene-expression profiles using GPAR, an artificial intelligence platform

BMC Bioinformatics ◽

10.1186/s12859-020-03915-6 ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Shengqiao Gao ◽

Lu Han ◽

Dan Luo ◽

Gang Liu ◽

Zhiyong Xiao ◽

...

Keyword(s):

Gene Expression ◽

Deep Learning ◽

Large Scale ◽

Prediction Models ◽

Expression Profiles ◽

Gene Expression Profiles ◽

Gene Set Enrichment Analysis ◽

Genetic Profile ◽

Drug Induced ◽

Drug Mechanism

Abstract Background Querying drug-induced gene expression profiles with machine learning method is an effective way for revealing drug mechanism of actions (MOAs), which is strongly supported by the growth of large scale and high-throughput gene expression databases. However, due to the lack of code-free and user friendly applications, it is not easy for biologists and pharmacologists to model MOAs with state-of-art deep learning approach. Results In this work, a newly developed online collaborative tool, Genetic profile-activity relationship (GPAR) was built to help modeling and predicting MOAs easily via deep learning. The users can use GPAR to customize their training sets to train self-defined MOA prediction models, to evaluate the model performances and to make further predictions automatically. Cross-validation tests show GPAR outperforms Gene set enrichment analysis in predicting MOAs. Conclusion GPAR can serve as a better approach in MOAs prediction, which may facilitate researchers to generate more reliable MOA hypothesis.

Download Full-text

General Statistics of Stochastic Process of Gene Expression in Eukaryotic Cells

Genetics ◽

10.1093/genetics/161.3.1321 ◽

2002 ◽

Vol 161 (3) ◽

pp. 1321-1332 ◽

Cited By ~ 1

Author(s):

V A Kuznetsov ◽

G D Knott ◽

R F Bonner

Keyword(s):

Gene Expression ◽

Single Cell ◽

Large Scale ◽

Phenotypic Diversity ◽

Biological Significance ◽

Cell Types ◽

Distribution Model ◽

Monoallelic Expression ◽

Considerable Uncertainty ◽

Number Of Genes

Abstract Thousands of genes are expressed at such very low levels (≤1 copy per cell) that global gene expression analysis of rarer transcripts remains problematic. Ambiguity in identification of rarer transcripts creates considerable uncertainty in fundamental questions such as the total number of genes expressed in an organism and the biological significance of rarer transcripts. Knowing the distribution of the true number of genes expressed at each level and the corresponding gene expression level probability function (GELPF) could help resolve these uncertainties. We found that all observed large-scale gene expression data sets in yeast, mouse, and human cells follow a Pareto-like distribution model skewed by many low-abundance transcripts. A novel stochastic model of the gene expression process predicts the universality of the GELPF both across different cell types within a multicellular organism and across different organisms. This model allows us to predict the frequency distribution of all gene expression levels within a single cell and to estimate the number of expressed genes in a single cell and in a population of cells. A random “basal” transcription mechanism for protein-coding genes in all or almost all eukaryotic cell types is predicted. This fundamental mechanism might enhance the expression of rarely expressed genes and, thus, provide a basic level of phenotypic diversity, adaptability, and random monoallelic expression in cell populations.

Download Full-text

Human embryoid bodies as a novel system for genomic studies of functionally diverse cell types

10.1101/2021.06.16.448714 ◽

2021 ◽

Author(s):

Katherine Rhodes ◽

Kenneth A Barr ◽

Joshua M Popp ◽

Benjamin J Strober ◽

Alexis Battle ◽

...

Keyword(s):

Gene Expression ◽

Large Scale ◽

Developmental Trajectories ◽

Cell Types ◽

Embryoid Bodies ◽

Regulatory Processes ◽

Study Gene Expression ◽

Genomic Studies ◽

Functionally Diverse ◽

Affect Gene Expression

Most disease-associated loci, though located in putatively regulatory regions, have not yet been confirmed to affect gene expression. One reason for this could be that we have not examined gene expression in the most relevant cell types or conditions. Indeed, even large-scale efforts to study gene expression broadly across tissues are limited by the necessity of obtaining human samples post-mortem, and almost exclusively from adults. Thus, there is an acute need to expand gene regulatory studies in humans to the most relevant cell types, tissues, and states. We propose that embryoid bodies (EBs), which are organoids that contain a multitude of cell types in dynamic states, can provide an answer. Single cell RNA-sequencing now provides a way to interrogate developmental trajectories in EBs and enhance the potential to uncover dynamic regulatory processes that would be missed in studies of static adult tissue. Here, we examined the properties of the EB model for the purpose mapping inter-individual regulatory differences in a large variety of cell types.

Download Full-text

TreeVibes: Modern tools for Global Monitoring of Trees Against Borers

10.20944/preprints202011.0411.v1 ◽

2020 ◽

Author(s):

Iraklis Rigakis ◽

Ilyas Potamitis ◽

Nicolas Alexander Tatlas ◽

Stelios M. Potirakis ◽

Stavros Ntalampiras

Keyword(s):

Deep Learning ◽

Large Scale ◽

Cost Effective ◽

Automatic Monitoring ◽

Urban Environments ◽

Learning Approaches ◽

Learning Techniques ◽

Longhorn Beetles ◽

Cloud Server ◽

Wood Boring

Is there a wood-feeding insect inside a tree or wooden structure? We investigate several ways on how deep learning approaches can massively scan recordings of vibrations stemming from probed trees to infer their infestation state with wood-boring insects that feed and move inside wood. The recordings come from remotely controlled devices that sample the internal soundscape of trees on a 24/7 basis and wirelessly transmit brief recordings of the registered vibrations to a cloud server. We discuss the different sources of vibrations that can be picked up from trees in urban environments and how deep learning methods can focus on those originating from borers. Our goal is to match the problem of the accelerated—due to global trade and climate change— establishment of invasive xylophagus insects by increasing the capacity of inspection agencies. We aim at introducing permanent, cost-effective, automatic monitoring of trees based on deep learning techniques, in commodity entry point as well as in wild, urban and cultivated areas in order to effect large-scale, sustainable pest-risk analysis and management of wood boring insects such as those from the Cerambycidae family (longhorn beetles).

Download Full-text