Deep-learning-based cell composition analysis from tissue expression profiles

AbstractWe present Scaden, a deep neural network for cell deconvolution that uses gene expression information to infer the cellular composition of tissues. Scaden is trained on single cell RNA-seq data to engineer discriminative features that confer robustness to bias and noise, making complex data preprocessing and feature selection unnecessary. We demonstrate that Scaden outperforms existing deconvolution algorithms in both precision and robustness. A single trained network reliably deconvolves bulk RNA-seq and microarray, human and mouse tissue expression data and leverages the combined information of multiple data sets. Due to this stability and flexibility, we surmise that deep learning will become an algorithmic mainstay for cell deconvolution of various data types. Scaden’s comprehensive software package is easy to use on novel as well as diverse existing expression datasets available in public resources, deepening the molecular and cellular understanding of developmental and disease processes.

Download Full-text

Deep learning–based cell composition analysis from tissue expression profiles

Science Advances ◽

10.1126/sciadv.aba2619 ◽

2020 ◽

Vol 6 (30) ◽

pp. eaba2619 ◽

Cited By ~ 4

Author(s):

Kevin Menden ◽

Mohamed Marouf ◽

Sergio Oller ◽

Anupriya Dalmia ◽

Daniel Sumner Magruder ◽

...

Keyword(s):

Deep Learning ◽

Web Application ◽

Expression Profiles ◽

Tissue Expression ◽

Mouse Tissue ◽

Composition Analysis ◽

Complex Data ◽

Rna Seq ◽

Data Types ◽

Multiple Datasets

We present Scaden, a deep neural network for cell deconvolution that uses gene expression information to infer the cellular composition of tissues. Scaden is trained on single-cell RNA sequencing (RNA-seq) data to engineer discriminative features that confer robustness to bias and noise, making complex data preprocessing and feature selection unnecessary. We demonstrate that Scaden outperforms existing deconvolution algorithms in both precision and robustness. A single trained network reliably deconvolves bulk RNA-seq and microarray, human and mouse tissue expression data and leverages the combined information of multiple datasets. Because of this stability and flexibility, we surmise that deep learning will become an algorithmic mainstay for cell deconvolution of various data types. Scaden’s software package and web application are easy to use on new as well as diverse existing expression datasets available in public resources, deepening the molecular and cellular understanding of developmental and disease processes.

Download Full-text

Supporting Regenerative Medicine by Integrative Dimensionality Reduction

Methods of Information in Medicine ◽

10.3414/me11-02-0045 ◽

2012 ◽

Vol 51 (04) ◽

pp. 341-347 ◽

Cited By ~ 2

Author(s):

F. Mulas ◽

L. Zagar ◽

B. Zupan ◽

R. Bellazzi

Keyword(s):

Regenerative Medicine ◽

Dimensionality Reduction ◽

Predictive Accuracy ◽

Expression Profiles ◽

Training Data ◽

Data Sets ◽

Developmental Potential ◽

Multiple Data ◽

Reduction Methods ◽

Multiple Data Sets

SummaryObjective: The assessment of the developmental potential of stem cells is a crucial step towards their clinical application in regenerative medicine. It has been demonstrated that genome-wide expression profiles can predict the cellular differentiation stage by means of dimensionality reduction methods. Here we show that these techniques can be further strengthened to support decision making with i) a novel strategy for gene selection; ii) methods for combining the evidence from multiple data sets.Methods: We propose to exploit dimensionality reduction methods for the selection of genes specifically activated in different stages of differentiation. To obtain an integrated predictive model, the expression values of the selected genes from multiple data sets are combined. We investigated distinct approaches that either aggregate data sets or use learning ensembles.Results: We analyzed the performance of the proposed methods on six publicly available data sets. The selection procedure identified a reduced subset of genes whose expression values gave rise to an accurate stage prediction. The assessment of predictive accuracy demonstrated a high quality of predictions for most of the data integration methods presented.Conclusion: The experimental results highlighted the main potentials of proposed approaches. These include the ability to predict the true staging by combining multiple training data sets when this could not be inferred from a single data source, and to focus the analysis on a reduced list of genes of similar predictive performance.

Download Full-text

Preprocessing implementation for microarray (PRIM): an efficient method for processing cDNA microarray data

Physiological Genomics ◽

10.1152/physiolgenomics.2001.4.3.183 ◽

2001 ◽

Vol 4 (3) ◽

pp. 183-188 ◽

Cited By ~ 38

Author(s):

KOJI KADOTA ◽

RIKA MIKI ◽

HIDEMASA BONO ◽

KENTARO SHIMIZU ◽

YASUSHI OKAZAKI ◽

...

Keyword(s):

Cdna Microarray ◽

Expression Profiles ◽

Threshold Value ◽

Tissue Expression ◽

Mouse Tissue ◽

Analysis Software ◽

Profile Data ◽

Microarray Image ◽

Data Processing Method ◽

Tissue Expression Profile

cDNA microarray technology is useful for systematically analyzing the expression profiles of thousands of genes at once. Although many useful results inferred by using this technology and a hierarchical clustering method for statistical analysis have been confirmed using other methods, there are still questions about the reproducibility of the data. We have therefore developed a data processing method that very efficiently extracts reproducible data from the result of duplicate experiments. It is designed to automatically filter the raw results obtained from cDNA microarray image-analysis software. We optimize the threshold value for filtering the data by using the product of N and R, where N is the ratio of the number of spots that passed the filtering vs. the total number of spots, and R is the correlation coefficient for results obtained in the duplicate experiments. Using this method to process mouse tissue expression profile data that contain 1,881,600 points of analysis, we obtained clustered results more reasonable than those obtained using previously reported filtering methods.

Download Full-text

Verifying explainability of a deep learning tissue classifier trained on RNA-seq data

Scientific Reports ◽

10.1038/s41598-021-81773-9 ◽

2021 ◽

Vol 11 (1) ◽

Author(s):

Melvyn Yap ◽

Rebecca L. Johnston ◽

Helena Foley ◽

Samual MacDonald ◽

Olga Kondrashova ◽

...

Keyword(s):

Deep Learning ◽

Differential Expression ◽

Expression Analysis ◽

Differential Expression Analysis ◽

Tissue Expression ◽

Superior Performance ◽

Rna Seq ◽

Widespread Acceptance ◽

And Function ◽

Deep Learning Model

AbstractFor complex machine learning (ML) algorithms to gain widespread acceptance in decision making, we must be able to identify the features driving the predictions. Explainability models allow transparency of ML algorithms, however their reliability within high-dimensional data is unclear. To test the reliability of the explainability model SHapley Additive exPlanations (SHAP), we developed a convolutional neural network to predict tissue classification from Genotype-Tissue Expression (GTEx) RNA-seq data representing 16,651 samples from 47 tissues. Our classifier achieved an average F1 score of 96.1% on held-out GTEx samples. Using SHAP values, we identified the 2423 most discriminatory genes, of which 98.6% were also identified by differential expression analysis across all tissues. The SHAP genes reflected expected biological processes involved in tissue differentiation and function. Moreover, SHAP genes clustered tissue types with superior performance when compared to all genes, genes detected by differential expression analysis, or random genes. We demonstrate the utility and reliability of SHAP to explain a deep learning model and highlight the strengths of applying ML to transcriptome data.

Download Full-text

FIDDLE: An integrative deep learning framework for functional genomic data inference

10.1101/081380 ◽

2016 ◽

Cited By ~ 7

Author(s):

Umet Eser ◽

L. Stirling Churchman

Keyword(s):

Deep Learning ◽

Data Types ◽

Unified Framework ◽

Transcription Start Sites ◽

Integrative Framework ◽

Sequencing Technologies ◽

Multiple Data ◽

Open Source Data ◽

Flexible Integration ◽

Data Inference

AbstractNumerous advances in sequencing technologies have revolutionized genomics through generating many types of genomic functional data. Statistical tools have been developed to analyze individual data types, but there lack strategies to integrate disparate datasets under a unified framework. Moreover, most analysis techniques heavily rely on feature selection and data preprocessing which increase the difficulty of addressing biological questions through the integration of multiple datasets. Here, we introduce FIDDLE (Flexible Integration of Data with Deep LEarning) an open source data-agnostic flexible integrative framework that learns a unified representation from multiple data types to infer another data type. As a case study, we use multiple Saccharomyces cerevisiae genomic datasets to predict global transcription start sites (TSS) through the simulation of TSS-seq data. We demonstrate that a type of data can be inferred from other sources of data types without manually specifying the relevant features and preprocessing. We show that models built from multiple genome-wide datasets perform profoundly better than models built from individual datasets. Thus FIDDLE learns the complex synergistic relationship within individual datasets and, importantly, across datasets.

Download Full-text

Massive single-cell RNA-seq analysis and imputation via deep learning

10.1101/315556 ◽

2018 ◽

Cited By ~ 9

Author(s):

Yue Deng ◽

Feng Bao ◽

Qionghai Dai ◽

Lani F. Wu ◽

Steven J. Altschuler

Keyword(s):

Deep Learning ◽

Single Cell ◽

Large Scale ◽

Expression Profiles ◽

Gene Expression Profiles ◽

Rna Seq ◽

Fine Grained ◽

Cell Type Composition ◽

Type Composition ◽

Cell Gene Expression

Recent advances in large-scale single cell RNA-seq enable fine-grained characterization of phenotypically distinct cellular states within heterogeneous tissues. We present scScope, a scalable deep-learning based approach that can accurately and rapidly identify cell-type composition from millions of noisy single-cell gene-expression profiles.

Download Full-text

Deep learning of gene relationships from single cell time-course expression data

10.1101/2020.09.21.306332 ◽

2020 ◽

Author(s):

Ye Yuan ◽

Ziv Bar-Joseph

Keyword(s):

Time Series ◽

Deep Learning ◽

Single Cell ◽

Time Course ◽

Expression Profiles ◽

Regulatory Gene ◽

Supplementary Information ◽

Expression Data ◽

Rna Seq ◽

Time Course Data

AbstractMotivationTime-course gene expression data has been widely used to infer regulatory and signaling relationships between genes. Most of the widely used methods for such analysis were developed for bulk expression data. Single cell RNA-Seq (scRNA-Seq) data offers several advantages including the large number of expression profiles available and the ability to focus on individual cells rather than averages. However, this data also raises new computational challenges.ResultsUsing a novel encoding for scRNA-Seq expression data we develop deep learning methods for interaction prediction from time-course data. Our methods use a supervised framework which represents the data as a 3D tensor and train convolutional and recurrent neural networks (CNN and RNN) for predicting interactions. We tested our Time-course Deep Learning (TDL) models on five different time series scRNA-Seq datasets. As we show, TDL can accurately identify causal and regulatory gene-gene interactions and can also be used to assign new function to genes. TDL improves on prior methods for the above tasks and can be generally applied to new time series scRNA-Seq data.Availability and ImplementationFreely available at https://github.com/xiaoyeye/[email protected] informationSupplementary data are available at XXX online.

Download Full-text

Identification and characterization of key long non-coding RNAs in the mouse cochlea

10.1101/2020.07.10.197251 ◽

2020 ◽

Author(s):

Tal Koffler-Brill ◽

Shahar Taiber ◽

Alejandro Anaya ◽

Mor Bordeynik-Cohen ◽

Einat Rosen ◽

...

Keyword(s):

Inner Ear ◽

Auditory System ◽

Cellular Localization ◽

Expression Profiles ◽

Cell Types ◽

Tissue Expression ◽

Distinct Pattern ◽

Mouse Tissue ◽

Large Variability ◽

Non Coding Rnas

AbstractThe auditory system is a complex sensory network with an orchestrated multilayer regulatory program governing its development and maintenance. Accumulating evidence has implicated long non-coding RNAs (lncRNAs) as important regulators in numerous systems, as well as in pathological pathways. However, their function in the auditory system has yet to be explored. Using a set of specific criteria, we selected four lncRNAs expressed in the mouse cochlea, which are conserved in the human transcriptome and are relevant for inner ear function. Bioinformatic characterization demonstrated a lack of coding potential and an absence of evolutionary conservation that represent properties commonly shared by their class members. RNAscope analysis of the spatial and temporal expression profiles revealed specific localization to inner ear cells. Sub-cellular localization analysis presented a distinct pattern for each lncRNA and mouse tissue expression evaluation displayed a large variability in terms of level and location. Our findings establish the expression of specific lncRNAs in different cell types of the auditory system and present a potential pathway by which the lncRNA Gas5 acts in the inner ear. Studying lncRNAs and deciphering their functions may deepen our knowledge of inner ear physiology and morphology and may reveal the basis of as yet unresolved genetic hearing loss-related pathologies. Moreover, our experimental design may be employed as a reference for studying other inner ear-related lncRNAs, as well as lncRNAs expressed in other sensory systems.

Download Full-text

Coupled Co-clustering-based Unsupervised Transfer Learning for the Integrative Analysis of Single-Cell Genomic Data

10.1101/2020.03.28.013938 ◽

2020 ◽

Author(s):

Pengcheng Zeng ◽

Jiaxuan WangWu ◽

Zhixiang Lin

Keyword(s):

Single Cell ◽

Transfer Learning ◽

Learning Algorithm ◽

Genomic Data ◽

Integrative Analysis ◽

Data Sets ◽

Clustering Methods ◽

Data Types ◽

Multiple Data ◽

Multiple Data Sets

AbstractUnsupervised methods, such as clustering methods, are essential to the analysis of single-cell genomic data. Most current clustering methods are designed for one data type only, such as scRNA-seq, scATAC-seq or sc-methylation data alone, and a few are developed for the integrative analysis of multiple data types. Integrative analysis of multimodal single-cell genomic data sets leverages the power in multiple data sets and can deepen the biological insight. We propose a coupled co-clustering-based unsupervised transfer learning algorithm (coupleCoC) for the integrative analysis of multimodal single-cell data. Our proposed coupleCoC builds upon the information theoretic co-clustering framework. We applied coupleCoC for the integrative analysis of scATAC-seq and scRNA-seq data, sc-methylation and scRNA-seq data, and scRNA-seq data from mouse and human. We demonstrate that coupleCoC improves the overall clustering performance and matches the cell subpopulations across multimodal single-cell genomic data sets. The software and data sets are available at https://github.com/cuhklinlab/coupleCoC.

Download Full-text

Variational autoencoders for cancer data integration: design principles and computational practice

10.1101/719542 ◽

2019 ◽

Cited By ~ 2

Author(s):

Nikola Simidjievski ◽

Cristian Bodnar ◽

Ifrah Tariq ◽

Paul Scherer ◽

Helena Andres-Terre ◽

...

Keyword(s):

Breast Cancer ◽

Clinical Data ◽

Molecular Taxonomy ◽

Patient Data ◽

Data Sets ◽

Data Types ◽

Cancer Data ◽

Learning Framework ◽

Multiple Data ◽

Multiple Data Sets

ABSTRACTInternational initiatives such as the Molecular Taxonomy of Breast Cancer International Consortium (METABRIC) are collecting multiple data sets at different genome-scales with the aim to identify novel cancer bio-markers and predict patient survival. To analyse such data, several machine learning, bioinformatics and statistical methods have been applied, among them neural networks such as autoencoders. Although these models provide a good statistical learning framework to analyse multi-omic and/or clinical data, there is a distinct lack of work on how to integrate diverse patient data and identify the optimal design best suited to the available data.In this paper, we investigate several autoencoder architectures that integrate a variety of cancer patient data types (e.g., multi-omics and clinical data). We perform extensive analyses of these approaches and provide a clear methodological and computational framework for designing systems that enable clinicians to investigate cancer traits and translate the results into clinical applications. We demonstrate how these networks can be designed, built and, in particular, applied to tasks of integrative analyses of heterogeneous breast cancer data. The results show that these approaches yield relevant data representations that, in turn, lead to accurate and stable diagnosis.

Download Full-text