scholarly journals Probabilistic Harmonization and Annotation of Single-cell Transcriptomics Data with Deep Generative Models

2019 ◽  
Author(s):  
Chenling Xu ◽  
Romain Lopez ◽  
Edouard Mehlman ◽  
Jeffrey Regier ◽  
Michael I. Jordan ◽  
...  

AbstractAs single-cell transcriptomics becomes a mainstream technology, the natural next step is to integrate the accumulating data in order to achieve a common ontology of cell types and states. However, owing to various nuisance factors of variation, it is not straightforward how to compare gene expression levels across data sets and how to automatically assign cell type labels in a new data set based on existing annotations. In this manuscript, we demonstrate that our previously developed method, scVI, provides an effective and fully probabilistic approach for joint representation and analysis of cohorts of single-cell RNA-seq data sets, while accounting for uncertainty caused by biological and measurement noise. We also introduce single-cell ANnotation using Variational Inference (scANVI), a semi-supervised variant of scVI designed to leverage any available cell state annotations — for instance when only one data set in a cohort is annotated, or when only a few cells in a single data set can be labeled using marker genes. We demonstrate that scVI and scANVI compare favorably to the existing methods for data integration and cell state annotation in terms of accuracy, scalability, and adaptability to challenging settings such as a hierarchical structure of cell state labels. We further show that different from existing methods, scVI and scANVI represent the integrated datasets with a single generative model that can be directly used for any probabilistic decision making task, using differential expression as our case study. scVI and scANVI are available as open source software and can be readily used to facilitate cell state annotation and help ensure consistency and reproducibility across studies.

2018 ◽  
Author(s):  
Brian Hie ◽  
Bryan Bryson ◽  
Bonnie Berger

AbstractResearchers are generating single-cell RNA sequencing (scRNA-seq) profiles of diverse biological systems1–4 and every cell type in the human body.5 Leveraging this data to gain unprecedented insight into biology and disease will require assembling heterogeneous cell populations across multiple experiments, laboratories, and technologies. Although methods for scRNA-seq data integration exist6,7, they often naively merge data sets together even when the data sets have no cell types in common, leading to results that do not correspond to real biological patterns. Here we present Scanorama, inspired by algorithms for panorama stitching, that overcomes the limitations of existing methods to enable accurate, heterogeneous scRNA-seq data set integration. Our strategy identifies and merges the shared cell types among all pairs of data sets and is orders of magnitude faster than existing techniques. We use Scanorama to combine 105,476 cells from 26 diverse scRNA-seq experiments across 9 different technologies into a single comprehensive reference, demonstrating how Scanorama can be used to obtain a more complete picture of cellular function across a wide range of scRNA-seq experiments.


2018 ◽  
Author(s):  
Wennan Chang ◽  
Changlin Wan ◽  
Xiaoyu Lu ◽  
Szu-wei Tu ◽  
Yifan Sun ◽  
...  

AbstractWe developed a novel deconvolution method, namely Inference of Cell Types and Deconvolution (ICTD) that addresses the fundamental issue of identifiability and robustness in current tissue data deconvolution problem. ICTD provides substantially new capabilities for omics data based characterization of a tissue microenvironment, including (1) maximizing the resolution in identifying resident cell and sub types that truly exists in a tissue, (2) identifying the most reliable marker genes for each cell type, which are tissue and data set specific, (3) handling the stability problem with co-linear cell types, (4) co-deconvoluting with available matched multi-omics data, and (5) inferring functional variations specific to one or several cell types. ICTD is empowered by (i) rigorously derived mathematical conditions of identifiable cell type and cell type specific functions in tissue transcriptomics data and (ii) a semi supervised approach to maximize the knowledge transfer of cell type and functional marker genes identified in single cell or bulk cell data in the analysis of tissue data, and (iii) a novel unsupervised approach to minimize the bias brought by training data. Application of ICTD on real and single cell simulated tissue data validated that the method has consistently good performance for tissue data coming from different species, tissue microenvironments, and experimental platforms. Other than the new capabilities, ICTD outperformed other state-of-the-art devolution methods on prediction accuracy, the resolution of identifiable cell, detection of unknown sub cell types, and assessment of cell type specific functions. The premise of ICTD also lies in characterizing cell-cell interactions and discovering cell types and prognostic markers that are predictive of clinical outcomes.


2019 ◽  
Vol 21 (5) ◽  
pp. 1581-1595 ◽  
Author(s):  
Xinlei Zhao ◽  
Shuang Wu ◽  
Nan Fang ◽  
Xiao Sun ◽  
Jue Fan

Abstract Single-cell RNA sequencing (scRNA-seq) has been rapidly developing and widely applied in biological and medical research. Identification of cell types in scRNA-seq data sets is an essential step before in-depth investigations of their functional and pathological roles. However, the conventional workflow based on clustering and marker genes is not scalable for an increasingly large number of scRNA-seq data sets due to complicated procedures and manual annotation. Therefore, a number of tools have been developed recently to predict cell types in new data sets using reference data sets. These methods have not been generally adapted due to a lack of tool benchmarking and user guidance. In this article, we performed a comprehensive and impartial evaluation of nine classification software tools specifically designed for scRNA-seq data sets. Results showed that Seurat based on random forest, SingleR based on correlation analysis and CaSTLe based on XGBoost performed better than others. A simple ensemble voting of all tools can improve the predictive accuracy. Under nonideal situations, such as small-sized and class-imbalanced reference data sets, tools based on cluster-level similarities have superior performance. However, even with the function of assigning ‘unassigned’ labels, it is still challenging to catch novel cell types by solely using any of the single-cell classifiers. This article provides a guideline for researchers to select and apply suitable classification tools in their analysis workflows and sheds some lights on potential direction of future improvement on classification tools.


Author(s):  
Xiaoyu Lu ◽  
Szu-Wei Tu ◽  
Wennan Chang ◽  
Changlin Wan ◽  
Jiashi Wang ◽  
...  

Abstract Deconvolution of mouse transcriptomic data is challenged by the fact that mouse models carry various genetic and physiological perturbations, making it questionable to assume fixed cell types and cell type marker genes for different data set scenarios. We developed a Semi-Supervised Mouse data Deconvolution (SSMD) method to study the mouse tissue microenvironment. SSMD is featured by (i) a novel nonparametric method to discover data set-specific cell type signature genes; (ii) a community detection approach for fixing cell types and their marker genes; (iii) a constrained matrix decomposition method to solve cell type relative proportions that is robust to diverse experimental platforms. In summary, SSMD addressed several key challenges in the deconvolution of mouse tissue data, including: (i) varied cell types and marker genes caused by highly divergent genotypic and phenotypic conditions of mouse experiment; (ii) diverse experimental platforms of mouse transcriptomics data; (iii) small sample size and limited training data source and (iv) capable to estimate the proportion of 35 cell types in blood, inflammatory, central nervous or hematopoietic systems. In silico and experimental validation of SSMD demonstrated its high sensitivity and accuracy in identifying (sub) cell types and predicting cell proportions comparing with state-of-the-arts methods. A user-friendly R package and a web server of SSMD are released via https://github.com/xiaoyulu95/SSMD.


2021 ◽  
Author(s):  
Nathanael Andrews ◽  
Martin Enge

Abstract CIM-seq is a tool for deconvoluting RNA-seq data from cell multiplets (clusters of two or more cells) in order to identify physically interacting cell in a given tissue. The method requires two RNAseq data sets from the same tissue: one of single cells to be used as a reference, and one of cell multiplets to be deconvoluted. CIM-seq is compatible with both droplet based sequencing methods, such as Chromium Single Cell 3′ Kits from 10x genomics; and plate based methods, such as Smartseq2. The pipeline consists of three parts: 1) Dissociation of the target tissue, FACS sorting of single cells and multiplets, and conventional scRNA-seq 2) Feature selection and clustering of cell types in the single cell data set - generating a blueprint of transcriptional profiles in the given tissue 3) Computational deconvolution of multiplets through a maximum likelihood estimation (MLE) to determine the most likely cell type constituents of each multiplet.


2019 ◽  
Author(s):  
Roman Hillje ◽  
Pier Giuseppe Pelicci ◽  
Lucilla Luzi

AbstractSummaryDespite the growing availability of sophisticated bioinformatic methods for the analysis of single-cell RNA-seq data, few tools exist that allow biologists without bioinformatic expertise to directly visualize and interact with their own data and results. Here, we present Cerebro (cell report browser), a Shiny- and Electron-based standalone desktop application for macOS and Windows, which allows investigation and inspection of pre-processed single-cell transcriptomics data without requiring bioinformatic experience of the user.Through an interactive and intuitive graphical interface, users can i) explore similarities and heterogeneity between samples and cells clusters in 2D or 3D projections such as t-SNE or UMAP, ii) display the expression level of single genes or genes sets of interest, iii) browse tables of most expressed genes and marker genes for each sample and cluster.We provide a simple example to show how Cerebro can be used and which are its capabilities. Through a focus on flexibility and direct access to data and results, we think Cerebro offers a collaborative framework for bioinformaticians and experimental biologists which facilitates effective interaction to shorten the gap between analysis and interpretation of the data.AvailabilityCerebro and example data sets are available at https://github.com/romanhaa/Cerebro. Similarly, the R packages cerebroApp and cerebroPrepare R packages are available at https://github.com/romanhaa/cerebroApp and https://github.com/romanhaa/cerebroPrepare, respectively. All components are released under the MIT License.


2020 ◽  
Author(s):  
Hy Vuong ◽  
Thao Truong ◽  
Tan Phan ◽  
Son Pham

AbstractMost widely used tools for finding marker genes in single cell data (SeuratT/NegBinom/Poisson, CellRanger, EdgeR, limmatrend) use a conventional definition of differentially expressed genes: genes with different mean expression values. However, in single-cell data, a cell population can be a mixture of many cell types/cell states, hence the mean expression of genes cannot represent the whole population. In addition, these tools assume that gene expression of a population belongs to a specific family of distribution. This assumption is often violated in single-cell data. In this work, we define marker genes of a cell population as genes that can be used to distinguish cells in the population from cells in other populations. Besides log-fold change, we devise a new metric to classify genes into up-regulated, down-regulated, and transitional states. In a benchmark for finding up-regulated and down-regulated genes, our tool outperforms all compared methods, including Seurat, ROTS, scDD, edgeR, MAST, limma, normal t-test, Wilcoxon and Kolmogorov–Smirnov test. Our method is much faster than all compared methods, therefore, enables interactive analysis for large single-cell data sets in BioTuring Browser. Venice algorithm is available within Signac package: https://github.com/bioturing/signac1).


2017 ◽  
Author(s):  
Giovanni Iacono ◽  
Elisabetta Mereu ◽  
Amy Guillaumet-Adkins ◽  
Roser Corominas ◽  
Ivon Cuscó ◽  
...  

AbstractSingle-cell RNA sequencing significantly deepened our insights into complex tissues and latest techniques are capable processing ten-thousands of cells simultaneously. With bigSCale, we provide an analytical framework being scalable to analyze millions of cells, addressing challenges of future large datasets. Unlike previous methods, bigSCale does not constrain data to fit an a priori-defined distribution and instead uses an accurate numerical model of noise. We evaluated the performance of bigSCale using a biological model of aberrant gene expression in patient derived neuronal progenitor cells and simulated datasets, which underlined its speed and accuracy in differential expression analysis. We further applied bigSCale to analyze 1.3 million cells from the mouse developing forebrain. Herein, we identified rare populations, such as Reelin positive Cajal-Retzius neurons, for which we determined a previously not recognized heterogeneity associated to distinct differentiation stages, spatial organization and cellular function. Together, bigSCale presents a perfect solution to address future challenges of large single-cell datasets.Extended AbstractSingle-cell RNA sequencing (scRNAseq) significantly deepened our insights into complex tissues by providing high-resolution phenotypes for individual cells. Recent microfluidic-based methods are scalable to ten-thousands of cells, enabling an unbiased sampling and comprehensive characterization without prior knowledge. Increasing cell numbers, however, generates extremely big datasets, which extends processing time and challenges computing resources. Current scRNAseq analysis tools are not designed to analyze datasets larger than from thousands of cells and often lack sensitivity and specificity to identify marker genes for cell populations or experimental conditions. With bigSCale, we provide an analytical framework for the sensitive detection of population markers and differentially expressed genes, being scalable to analyze millions of single cells. Unlike other methods that use simple or mixture probabilistic models with negative binomial, gamma or Poisson distributions to handle the noise and sparsity of scRNAseq data, bigSCale does not constrain the data to fit an a priori-defined distribution. Instead, bigSCale uses large sample sizes to estimate a highly accurate and comprehensive numerical model of noise and gene expression. The framework further includes modules for differential expression (DE) analysis, cell clustering and population marker identification. Moreover, a directed convolution strategy allows processing of extremely large data sets, while preserving the transcript information from individual cells.We evaluate the performance of bigSCale using a biological model for reduced or elevated gene expression levels. Specifically, we perform scRNAseq of 1,920 patient derived neuronal progenitor cells from Williams-Beuren and 7q11.23 microduplication syndrome patients, harboring a deletion or duplication of 7q11.23, respectively. The affected region contains 28 genes whose transcriptional levels vary in line with their allele frequency. BigSCale detects expression changes with respect to cells from a healthy donor and outperforms other methods for single-cell DE analysis in sensitivity. Simulated data sets, underline the performance of bigSCale in DE analysis as it is faster and more sensitive and specific than other methods. The probabilistic model of cell-distances within bigSCale is further suitable for unsupervised clustering and the identification of cell types and subpopulations. Using bigSCale, we identify all major cell types of the somatosensory cortex and hippocampus analyzing 3,005 cells from adult mouse brains. Remarkably, we increase the number of cell population specific marker genes 4-6-fold compared to the original analysis and, moreover, define markers of higher order cell types. These include CD90 (Thy1), a neuronal surface receptor, potentially suitable for isolating intact neurons from complex brain samples.To test its applicability for large data sets, we apply bigSCale on scRNAseq data from 1.3 million cells derived from the pallium of the mouse developing forebrain (E18, 10x Genomics). Our directed down-sampling strategy accumulates transcript counts from cells with similar transcriptional profiles into index cell transcriptomes, thereby defining cellular clusters with improved resolution. Accordingly, index cell clusters provide a rich resource of marker genes for the main brain cell types and less frequent subpopulations. Our analysis of rare populations includes poorly characterized developmental cell types, such as neuron progenitors from the subventricular zone and neocortical Reelin positive neurons known as Cajal-Retzius (CR) cells. The latter represent a transient population which regulates the laminar formation of the developing neocortex and whose malfunctioning causes major neurodevelopmental disorders like autism or schizophrenia. Most importantly, index cell cluster can be deconvoluted to individual cell level for targeted analysis of populations of interest. Through decomposition of Reelin positive neurons, we determined a previously not recognized heterogeneity among CR cells, which we could associate to distinct differentiation stages as well as spatial and functional differences in the developing mouse brain. Specifically, subtypes of CR cells identified by bigSCale express different compositions of NMDA, AMPA and glycine receptor subunits, pointing to subpopulations with distinct membrane properties. Furthermore, we found Cxcl12, a chemokine secreted by the meninges and regulating the tangential migration of CR cells, to be also expressed in CR cells located in the marginal zone of the neocortex, indicating a self-regulated migration capacity.Together, bigSCale presents a perfect solution for the processing and analysis of scRNAseq data from millions of single cells. Its speed and sensitivity makes it suitable to the address future challenges of large single-cell data sets.


2020 ◽  
Vol 11 (1) ◽  
Author(s):  
Shahin Mohammadi ◽  
Jose Davila-Velderrain ◽  
Manolis Kellis

Abstract Dissecting the cellular heterogeneity embedded in single-cell transcriptomic data is challenging. Although many methods and approaches exist, identifying cell states and their underlying topology is still a major challenge. Here, we introduce the concept of multiresolution cell-state decomposition as a practical approach to simultaneously capture both fine- and coarse-grain patterns of variability. We implement this concept in ACTIONet, a comprehensive framework that combines archetypal analysis and manifold learning to provide a ready-to-use analytical approach for multiresolution single-cell state characterization. ACTIONet provides a robust, reproducible, and highly interpretable single-cell analysis platform that couples dominant pattern discovery with a corresponding structural representation of the cell state landscape. Using multiple synthetic and real data sets, we demonstrate ACTIONet’s superior performance relative to existing alternatives. We use ACTIONet to integrate and annotate cells across three human cortex data sets. Through integrative comparative analysis, we define a consensus vocabulary and a consistent set of gene signatures discriminating against the transcriptomic cell types and subtypes of the human prefrontal cortex.


2018 ◽  
Author(s):  
Christopher Heje Grønbech ◽  
Maximillian Fornitz Vording ◽  
Pascal Timshel ◽  
Casper Kaae Sønderby ◽  
Tune Hannes Pers ◽  
...  

AbstractMotivationModels for analysing and making relevant biological inferences from massive amounts of complex single-cell transcriptomic data typically require several individual data-processing steps, each with their own set of hyperparameter choices. With deep generative models one can work directly with count data, make likelihood-based model comparison, learn a latent representation of the cells and capture more of the variability in different cell populations.ResultsWe propose a novel method based on variational auto-encoders (VAEs) for analysis of single-cell RNA sequencing (scRNA-seq) data. It avoids data preprocessing by using raw count data as input and can robustly estimate the expected gene expression levels and a latent representation for each cell. We tested several count likelihood functions and a variant of the VAE that has a priori clustering in the latent space. We show for several scRNA-seq data sets that our method outperforms recently proposed scRNA-seq methods in clustering cells and that the resulting clusters reflect cell types.Availability and implementationOur method, called scVAE, is implemented in Python using the TensorFlow machine-learning library, and it is freely available at https://github.com/scvae/scvae.


Sign in / Sign up

Export Citation Format

Share Document