scDesign2: an interpretable simulator that generates high-fidelity single-cell gene expression count data with gene correlations captured

AbstractIn the burgeoning field of single-cell transcriptomics, a pressing challenge is to benchmark various experimental protocols and numerous computational methods in an unbiased manner. Although dozens of simulators have been developed for single-cell RNA-seq (scRNA-seq) data, they lack the capacity to simultaneously achieve all the three goals: preserving genes, capturing gene correlations, and generating any number of cells with varying sequencing depths. To fill in this gap, here we propose scDesign2, an interpretable simulator that achieves all the three goals and generates high-fidelity synthetic data for multiple scRNA-seq protocols and other single-cell gene expression count-based technologies. Compared with existing simulators, scDesign2 is advantageous in its transparent use of probabilistic models and is unique in its ability to capture gene correlations via copula. We verify that scDesign2 generates more realistic synthetic data for four scRNA-seq protocols (10x Genomics, CEL-Seq2, Fluidigm C1, and Smart-Seq2) and two single-cell spatial transcriptomics protocols (MERFISH and pciSeq) than existing simulators do. Under two typical computational tasks, cell clustering and rare cell type detection, we demonstrate that scDesign2 provides informative guidance on deciding the optimal sequencing depth and cell number in single-cell RNA-seq experimental design, and that scDesign2 can effectively benchmark computational methods under varying sequencing depths and cell numbers. With these advantages, scDesign2 is a powerful tool for single-cell researchers to design experiments, develop computational methods, and choose appropriate methods for specific data analysis needs.

Download Full-text

scDesign2: a transparent simulator that generates high-fidelity single-cell gene expression count data with gene correlations captured

Genome Biology ◽

10.1186/s13059-021-02367-2 ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Tianyi Sun ◽

Dongyuan Song ◽

Wei Vivian Li ◽

Jingyi Jessica Li

Keyword(s):

Gene Expression ◽

Single Cell ◽

Count Data ◽

Probabilistic Models ◽

Synthetic Data ◽

High Fidelity ◽

Cell Gene Expression ◽

Experimental Protocols ◽

Cell Gene ◽

Number Of Cells

AbstractA pressing challenge in single-cell transcriptomics is to benchmark experimental protocols and computational methods. A solution is to use computational simulators, but existing simulators cannot simultaneously achieve three goals: preserving genes, capturing gene correlations, and generating any number of cells with varying sequencing depths. To fill this gap, we propose scDesign2, a transparent simulator that achieves all three goals and generates high-fidelity synthetic data for multiple single-cell gene expression count-based technologies. In particular, scDesign2 is advantageous in its transparent use of probabilistic models and its ability to capture gene correlations via copulas.

Download Full-text

Publisher Correction: scDesign2: a transparent simulator that generates high-fidelity single-cell gene expression count data with gene correlations captured

Genome Biology ◽

10.1186/s13059-021-02394-z ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Tianyi Sun ◽

Dongyuan Song ◽

Wei Vivian Li ◽

Jingyi Jessica Li

Keyword(s):

Gene Expression ◽

Single Cell ◽

Count Data ◽

High Fidelity ◽

Cell Gene Expression ◽

Cell Gene

Download Full-text

Biological Process Activity Transformation of Single Cell Gene Expression for Cross-Species Alignment

10.1101/555268 ◽

2019 ◽

Author(s):

Hongxu Ding ◽

Andrew Blair ◽

Ying Yang ◽

Joshua M. Stuart

Keyword(s):

Gene Expression ◽

Single Cell ◽

Biological Process ◽

Biological Processes ◽

Rna Seq ◽

Gene Set ◽

Cell Gene Expression ◽

Cell Gene

ABSTRACTThe maintenance and transition of cellular states are controlled by biological processes. Here we present a gene set-based transformation of single cell RNA-Seq data into biological process activities that provides a robust description of cellular states. Moreover, as these activities represent species-independent descriptors, they facilitate the alignment of single cell states across different organisms.

Download Full-text

scGMAI: a Gaussian mixture model for clustering single-cell RNA-Seq data based on deep autoencoder

Briefings in Bioinformatics ◽

10.1093/bib/bbaa316 ◽

2020 ◽

Author(s):

Bin Yu ◽

Chen Chen ◽

Ren Qi ◽

Ruiqing Zheng ◽

Patrick J Skillman-Lawrence ◽

...

Keyword(s):

Gene Expression ◽

Single Cell ◽

Rapid Development ◽

Cell Types ◽

Gaussian Mixture ◽

Computational Techniques ◽

Rna Seq ◽

Fast Independent Component Analysis ◽

Cell Gene Expression ◽

Cell Gene

Abstract The rapid development of single-cell RNA sequencing (scRNA-Seq) technology provides strong technical support for accurate and efficient analyzing single-cell gene expression data. However, the analysis of scRNA-Seq is accompanied by many obstacles, including dropout events and the curse of dimensionality. Here, we propose the scGMAI, which is a new single-cell Gaussian mixture clustering method based on autoencoder networks and the fast independent component analysis (FastICA). Specifically, scGMAI utilizes autoencoder networks to reconstruct gene expression values from scRNA-Seq data and FastICA is used to reduce the dimensions of reconstructed data. The integration of these computational techniques in scGMAI leads to outperforming results compared to existing tools, including Seurat, in clustering cells from 17 public scRNA-Seq datasets. In summary, scGMAI is an effective tool for accurately clustering and identifying cell types from scRNA-Seq data and shows the great potential of its applicative power in scRNA-Seq data analysis. The source code is available at https://github.com/QUST-AIBBDRC/scGMAI/.

Download Full-text

SOMSC: Self-Organization-Map for High-Dimensional Single-Cell Data of Cellular States and Their Transitions

10.1101/124693 ◽

2017 ◽

Cited By ~ 1

Author(s):

Tao Peng ◽

Qing Nie

Keyword(s):

Gene Expression ◽

Single Cell ◽

Gene Expression Data ◽

Single Cells ◽

High Dimensional ◽

Expression Data ◽

Rna Seq ◽

Cell Gene Expression ◽

Cell Data ◽

Cell Gene

AbstractMeasurement of gene expression levels for multiple genes in single cells provides a powerful approach to study heterogeneity of cell populations and cellular plasticity. While the expression levels of multiple genes in each cell are available in such data, the potential connections among the cells (e.g. the cellular state transition relationship) are not directly evident from the measurement. Classifying the cellular states, identifying their transitions among those states, and extracting the pseudotime ordering of cells are challenging due to the noise in the data and the high-dimensionality in the number of genes in the data. In this paper we adapt the classical self-organizing-map (SOM) approach for single-cell gene expression data (SOMSC), such as those based on single cell qPCR and single cell RNA-seq. In SOMSC, a cellular state map (CSM) is derived and employed to identify cellular states inherited in the population of the measured single cells. Cells located in the same basin of the CSM are considered as in one cellular state while barriers among the basins in CSM provide information on transitions among the cellular states. A cellular state transitions path (e.g. differentiation) and a temporal ordering of the measured single cells are consequently obtained. In addition, SOMSC could estimate the cellular state replication probability and transition probabilities. Applied to a set of synthetic data, one single-cell qPCR data set on mouse early embryonic development and two single-cell RNA-seq data sets, SOMSC shows effectiveness in capturing cellular states and their transitions presented in the high-dimensional single-cell data. This approach will have broader applications to analyzing cellular fate specification and cell lineages using single cell gene expression data

Download Full-text

Feature extraction approach in single-cell gene expression profiling for cell-type marker identification

10.1101/686659 ◽

2019 ◽

Author(s):

Nigatu A. Adossa ◽

Leif Schauser ◽

Vivi G. Gregersen ◽

Laura L. Elo

Keyword(s):

Gene Expression ◽

Single Cell ◽

Cell Types ◽

Type I ◽

Rna Seq ◽

Cell Type ◽

Cell Gene Expression ◽

Cell Type Specific ◽

Cell Gene ◽

Marker Identification

AbstractBackgroundRecent advances in single-cell gene expression profiling technology have revolutionized the understanding of molecular processes underlying developmental cell and tissue differentiation, enabling the discovery of novel cell-types and molecular markers that characterize developmental trajectories. Common approaches for identifying marker genes are based on pairwise statistical testing for differential gene expression between cell-types in heterogeneous cell populations, which is challenging due to unequal sample sizes and variance between groups resulting in little statistical power and inflated type I errors.ResultsWe developed an alternative feature extraction method, Marker gene Identification for Cell-type Identity (MICTI) that encodes the cell-type specific expression information to each gene in every single-cell. This approach identifies features (genes) that are cell-type specific for a given cell-type in heterogeneous cell population. To validate this approach, we used (i) simulated single cell RNA-seq data, (ii) human pancreatic islet single-cell RNA-seq data and (iii) a simulated mixture of human single-cell RNA-seq data related to immune cells, particularly B cells, CD4+ memory cells, CD8+ memory cells, dendritic cells, fibroblast cells, and lymphoblast cells. For all cases, we were able to identify established cell-type-specific markers.ConclusionsOur approach represents a highly efficient and fast method as an alternative to differential expression analysis for molecular marker identification in heterogeneous single-cell RNA-seq data.

Download Full-text

BGP: Branched Gaussian processes for identifying gene-specific branching dynamics in single cell data

10.1101/166868 ◽

2017 ◽

Cited By ~ 3

Author(s):

Alexis Boukouvalas ◽

James Hensman ◽

Magnus Rattray

Keyword(s):

Gene Expression ◽

Single Cell ◽

Prior Information ◽

Synthetic Data ◽

Parametric Model ◽

Credible Region ◽

Cell Gene Expression ◽

Probabilistic Nature ◽

Cell Data ◽

Cell Gene

AbstractHigh-throughput single-cell gene expression experiments can be used to uncover branching dynamics in cell populations undergoing differentiation through use of pseudotime methods. We develop the branching Gaussian process (BGP), a non-parametric model that is able to identify branching dynamics for individual genes and provides an estimate of branching times for each gene with an associated credible region. We demonstrate the effectiveness of our method on both synthetic data and a published single-cell gene expression hematopoiesis study. The method requires prior information about pseudotime and global cellular branching for each cell but the probabilistic nature of the method means that it is robust to errors in these global branch labels and can be used to discover early branching genes which diverge before the inferred global cell branching. The code is open-source and available at https://github.com/ManchesterBioinference/BranchedGP.

Download Full-text

Inferring relevant cell types for complex traits using single-cell gene expression

10.1101/136283 ◽

2017 ◽

Cited By ~ 2

Author(s):

Diego Calderon ◽

Anand Bhaskar ◽

David A. Knowles ◽

David Golan ◽

Towfique Raj ◽

...

Keyword(s):

Gene Expression ◽

Single Cell ◽

Complex Traits ◽

Late Onset ◽

Cell Types ◽

Rna Seq ◽

Cortical Cells ◽

Functional Regions ◽

Cell Gene Expression ◽

Cell Gene

AbstractPrevious studies have prioritized trait-relevant cell types by looking for an enrichment of GWAS signal within functional regions. However, these studies are limited in cell resolution by the lack of functional annotations from difficult-to-characterize or rare cell populations. Measurement of single-cell gene expression has become a popular method for characterizing novel cell types, and yet, hardly any work exists linking single-cell RNA-seq to phenotypes of interest. To address this deficiency, we present RolyPoly, a regression-based polygenic model that can prioritize trait-relevant cell types and genes from GWAS summary statistics and single-cell RNA-seq. We demonstrate RolyPoly’s accuracy through simulation and validate previously known tissue-trait associations. We discover a significant association between microglia and late-onset Alzheimer’s disease, and an association between oligodendrocytes and replicating fetal cortical cells with schizophrenia. Additionally, RolyPoly computes a trait-relevance score for each gene which reflects the importance of expression specific to a cell type. We found that differentially expressed genes in the prefrontal cortex of Alzheimer’s patients were significantly enriched for highly ranked genes by RolyPoly gene scores. Overall, our method represents a powerful framework for understanding the effect of common variants on cell types contributing to complex traits.

Download Full-text

Biological process activity transformation of single cell gene expression for cross-species alignment

Nature Communications ◽

10.1038/s41467-019-12924-w ◽

2019 ◽

Vol 10 (1) ◽

Cited By ~ 7

Author(s):

Hongxu Ding ◽

Andrew Blair ◽

Ying Yang ◽

Joshua M. Stuart

Keyword(s):

Gene Expression ◽

Single Cell ◽

Biological Process ◽

Biological Processes ◽

Rna Seq ◽

Gene Set ◽

Cell Gene Expression ◽

Cell Gene

Abstract The maintenance and transition of cellular states are controlled by biological processes. Here we present a gene set-based transformation of single cell RNA-Seq data into biological process activities that provides a robust description of cellular states. Moreover, as these activities represent species-independent descriptors, they facilitate the alignment of single cell states across different organisms.

Download Full-text