scholarly journals Scalable probabilistic matrix factorization for single-cell RNA-seq analysis

2018 ◽  
Author(s):  
Pedro F. Ferreira ◽  
Alexandra M. Carvalho ◽  
Susana Vinga

Motivation: The gene expression profile of a cell dictates its function in molecular processes, and can be used to probe its health status. This represents a step forward in the deep characterization of diseases such as cancer and may lead to breakthroughs in their treatment. The technology used to measure the gene expression of isolated cells, single-cell RNA-seq (scRNA-seq), has emerged in the last decade as a key enabler of this progress. However, the use of existing methods for dimensionality reduction, clustering and differential expression is limited by the specificities of the data obtained from scRNA-seq experiments, where technical factors may confound analyses of the true biological signal and contribute to spurious results. To overcome this issue, a possible approach is designing probabilistic generative models of the data with hidden variables encoding different underlying processes. Results: We propose two novel probabilistic models for scRNA-seq data: modified probabilistic count matrix factorization (m-pCMF) and Bayesian zero-inflated negative binomial factorization (ZINBayes). These build upon previous models in the literature while leveraging scalable Bayesian inference via variational methods. We show that the proposed methods are competitive with the state-of-the-art models for robust dimensionality reduction in modern data sets, and improve upon the current best Bayesian model for small numbers of cells. The results show that building probabilistic models of latent variables which encode domain knowledge and using variational inference constitute a promising approach to analyse scRNA-seq data in a scalable way. Availability: m-pCMF and ZINBayes are publicly available as Python packages at https://github.com/pedrofale/, along with the code to reproduce all the results. Contact: [email protected]

eLife ◽  
2019 ◽  
Vol 8 ◽  
Author(s):  
Dylan Kotliar ◽  
Adrian Veres ◽  
M Aurel Nagy ◽  
Shervin Tabrizi ◽  
Eran Hodis ◽  
...  

Identifying gene expression programs underlying both cell-type identity and cellular activities (e.g. life-cycle processes, responses to environmental cues) is crucial for understanding the organization of cells and tissues. Although single-cell RNA-Seq (scRNA-Seq) can quantify transcripts in individual cells, each cell’s expression profile may be a mixture of both types of programs, making them difficult to disentangle. Here, we benchmark and enhance the use of matrix factorization to solve this problem. We show with simulations that a method we call consensus non-negative matrix factorization (cNMF) accurately infers identity and activity programs, including their relative contributions in each cell. To illustrate the insights this approach enables, we apply it to published brain organoid and visual cortex scRNA-Seq datasets; cNMF refines cell types and identifies both expected (e.g. cell cycle and hypoxia) and novel activity programs, including programs that may underlie a neurosecretory phenotype and synaptogenesis.


2018 ◽  
Author(s):  
Dylan Kotliar ◽  
Adrian Veres ◽  
M. Aurel Nagy ◽  
Shervin Tabrizi ◽  
Eran Hodis ◽  
...  

AbstractIdentifying gene expression programs underlying both cell-type identity and cellular activities (e.g. life-cycle processes, responses to environmental cues) is crucial for understanding the organization of cells and tissues. Although single-cell RNA-Seq (scRNA-Seq) can quantify transcripts in individual cells, each cell’s expression profile may be a mixture of both types of programs, making them difficult to disentangle. Here we illustrate and enhance the use of matrix factorization as a solution to this problem. We show with simulations that a method that we call consensus non-negative matrix factorization (cNMF) accurately infers identity and activity programs, including the relative contribution of programs in each cell. Applied to published brain organoid and visual cortex scRNA-Seq datasets, cNMF refines the hierarchy of cell types and identifies both expected (e.g. cell cycle and hypoxia) and intriguing novel activity programs. We propose that one of the novel programs may reflect a neurosecretory phenotype and a second may underlie the formation of neuronal synapses. We make cNMF available to the community and illustrate how this approach can provide key insights into gene expression variation within and between cell types.


2020 ◽  
Author(s):  
Tianyi Sun ◽  
Dongyuan Song ◽  
Wei Vivian Li ◽  
Jingyi Jessica Li

AbstractIn the burgeoning field of single-cell transcriptomics, a pressing challenge is to benchmark various experimental protocols and numerous computational methods in an unbiased manner. Although dozens of simulators have been developed for single-cell RNA-seq (scRNA-seq) data, they lack the capacity to simultaneously achieve all the three goals: preserving genes, capturing gene correlations, and generating any number of cells with varying sequencing depths. To fill in this gap, here we propose scDesign2, an interpretable simulator that achieves all the three goals and generates high-fidelity synthetic data for multiple scRNA-seq protocols and other single-cell gene expression count-based technologies. Compared with existing simulators, scDesign2 is advantageous in its transparent use of probabilistic models and is unique in its ability to capture gene correlations via copula. We verify that scDesign2 generates more realistic synthetic data for four scRNA-seq protocols (10x Genomics, CEL-Seq2, Fluidigm C1, and Smart-Seq2) and two single-cell spatial transcriptomics protocols (MERFISH and pciSeq) than existing simulators do. Under two typical computational tasks, cell clustering and rare cell type detection, we demonstrate that scDesign2 provides informative guidance on deciding the optimal sequencing depth and cell number in single-cell RNA-seq experimental design, and that scDesign2 can effectively benchmark computational methods under varying sequencing depths and cell numbers. With these advantages, scDesign2 is a powerful tool for single-cell researchers to design experiments, develop computational methods, and choose appropriate methods for specific data analysis needs.


2022 ◽  
Author(s):  
Martin Treppner ◽  
Harald Binder ◽  
Moritz Hess

AbstractDeep generative models can learn the underlying structure, such as pathways or gene programs, from omics data. We provide an introduction as well as an overview of such techniques, specifically illustrating their use with single-cell gene expression data. For example, the low dimensional latent representations offered by various approaches, such as variational auto-encoders, are useful to get a better understanding of the relations between observed gene expressions and experimental factors or phenotypes. Furthermore, by providing a generative model for the latent and observed variables, deep generative models can generate synthetic observations, which allow us to assess the uncertainty in the learned representations. While deep generative models are useful to learn the structure of high-dimensional omics data by efficiently capturing non-linear dependencies between genes, they are sometimes difficult to interpret due to their neural network building blocks. More precisely, to understand the relationship between learned latent variables and observed variables, e.g., gene transcript abundances and external phenotypes, is difficult. Therefore, we also illustrate current approaches that allow us to infer the relationship between learned latent variables and observed variables as well as external phenotypes. Thereby, we render deep learning approaches more interpretable. In an application with single-cell gene expression data, we demonstrate the utility of the discussed methods.


2019 ◽  
Author(s):  
Koen Van den Berge ◽  
Hector Roux de Bézieux ◽  
Kelly Street ◽  
Wouter Saelens ◽  
Robrecht Cannoodt ◽  
...  

AbstractTrajectory inference has radically enhanced single-cell RNA-seq research by enabling the study of dynamic changes in gene expression levels during biological processes such as the cell cycle, cell type differentiation, and cellular activation. Downstream of trajectory inference, it is vital to discover genes that are associated with the lineages in the trajectory to illuminate the underlying biological processes. Furthermore, genes that are differentially expressed between developmental/activational lineages might be highly relevant to further unravel the system under study. Current data analysis procedures, however, typically cluster cells and assess differential expression between the clusters, which fails to exploit the continuous resolution provided by trajectory inference to its full potential. The few available non-cluster-based methods only assess broad differences in gene expression between lineages, hence failing to pinpoint the exact types of divergence. We introduce a powerful generalized additive model framework based on the negative binomial distribution that allows flexible inference of (i) within-lineage differential expression by detecting associations between gene expression and pseudotime over an entire lineage or by comparing gene expression between points/regions within the lineage and (ii) between-lineage differential expression by comparing gene expression between lineages over the entire lineages or at specific points/regions. By incorporating observation-level weights, the model additionally allows to account for zero inflation, commonly observed in single-cell RNA-seq data from full-length protocols. We evaluate the method on simulated and real datasets from droplet-based and full-length protocols, and show that the flexible inference framework is capable of yielding biological insights through a clear interpretation of the data.


2019 ◽  
Author(s):  
Ugur M. Ayturk ◽  
Joseph P. Scollan ◽  
Alexander Vesprey ◽  
Christina M. Jacobsen ◽  
Paola Divieti Pajevic ◽  
...  

ABSTRACTSingle cell RNA-seq (scRNA-seq) is emerging as a powerful technology to examine transcriptomes of individual cells. We determined whether scRNA-seq could be used to detect the effect of environmental and pharmacologic perturbations on osteoblasts. We began with a commonly used in vitro system in which freshly isolated neonatal mouse calvarial cells are expanded and induced to produce a mineralized matrix. We used scRNA-seq to compare the relative cell type abundances and the transcriptomes of freshly isolated cells to those that had been cultured for 12 days in vitro. We observed that the percentage of macrophage-like cells increased from 6% in freshly isolated calvarial cells to 34% in cultured cells. We also found that Bglap transcripts were abundant in freshly isolated osteoblasts but nearly undetectable in the cultured calvarial cells. Thus, scRNA-seq revealed significant differences between heterogeneity of cells in vivo and in vitro. We next performed scRNA-seq on freshly recovered long bone endocortical cells from mice that received either vehicle or Sclerostin-neutralizing antibody for 1 week. Bone anabolism-associated transcripts were also not significantly increased in immature and mature osteoblasts recovered from Sclerostin-neutralizing antibody treated mice; this is likely a consequence of being underpowered to detect modest changes in gene expression, since only 7% of the sequenced endocortical cells were osteoblasts, and a limited portion of their transcriptomes were sampled. We conclude that scRNA-seq can detect changes in cell abundance, identity, and gene expression in skeletally derived cells. In order to detect modest changes in osteoblast gene expression at the single cell level in the appendicular skeleton, larger numbers of osteoblasts from endocortical bone are required.


2019 ◽  
Author(s):  
Marcus Alvarez ◽  
Elior Rahmani ◽  
Brandon Jew ◽  
Kristina M. Garske ◽  
Zong Miao ◽  
...  

AbstractSingle-nucleus RNA sequencing (snRNA-seq) measures gene expression in individual nuclei instead of cells, allowing for unbiased cell type characterization in solid tissues. Contrary to single-cell RNA seq (scRNA-seq), we observe that snRNA-seq is commonly subject to contamination by high amounts of extranuclear background RNA, which can lead to identification of spurious cell types in downstream clustering analyses if overlooked. We present a novel approach to remove debris-contaminated droplets in snRNA-seq experiments, called Debris Identification using Expectation Maximization (DIEM). Our likelihood-based approach models the gene expression distribution of debris and cell types, which are estimated using EM. We evaluated DIEM using three snRNA-seq data sets: 1) human differentiating preadipocytes in vitro, 2) fresh mouse brain tissue, and 3) human frozen adipose tissue (AT) from six individuals. All three data sets showed various degrees of extranuclear RNA contamination. We observed that existing methods fail to account for contaminated droplets and led to spurious cell types. When compared to filtering using these state of the art methods, DIEM better removed droplets containing high levels of extranuclear RNA and led to higher quality clusters. Although DIEM was designed for snRNA-seq data, we also successfully applied DIEM to single-cell data. To conclude, our novel method DIEM removes debris-contaminated droplets from single-cell-based data fast and effectively, leading to cleaner downstream analysis. Our code is freely available for use at https://github.com/marcalva/diem.


Sign in / Sign up

Export Citation Format

Share Document