Stability of single-cell dimension reduction after data shuffling

Author(s):  
Yu Ting Hsu ◽  
Jia-Ming Chang
2020 ◽  
Vol 21 (1) ◽  
Author(s):  
F. William Townes ◽  
Stephanie C. Hicks ◽  
Martin J. Aryee ◽  
Rafael A. Irizarry

2019 ◽  
Vol 20 (1) ◽  
Author(s):  
F. William Townes ◽  
Stephanie C. Hicks ◽  
Martin J. Aryee ◽  
Rafael A. Irizarry

AbstractSingle-cell RNA-Seq (scRNA-Seq) profiles gene expression of individual cells. Recent scRNA-Seq datasets have incorporated unique molecular identifiers (UMIs). Using negative controls, we show UMI counts follow multinomial sampling with no zero inflation. Current normalization procedures such as log of counts per million and feature selection by highly variable genes produce false variability in dimension reduction. We propose simple multinomial methods, including generalized principal component analysis (GLM-PCA) for non-normal distributions, and feature selection using deviance. These methods outperform the current practice in a downstream clustering assessment using ground truth datasets.


2017 ◽  
Author(s):  
Dongfang Wang ◽  
Jin Gu

AbstractSingle cell RNA sequencing (scRNA-seq) is a powerful technique to analyze the transcriptomic heterogeneities in single cell level. It is an important step for studying cell sub-populations and lineages based on scRNA-seq data by finding an effective low-dimensional representation and visualization of the original data. The scRNA-seq data are much noiser than traditional bulk RNA-Seq: in the single cell level, the transcriptional fluctuations are much larger than the average of a cell population and the low amount of RNA transcripts will increase the rate of technical dropout events. In this study, we proposed VASC (deep Variational Autoencoder for scRNA-seq data), a deep multi-layer generative model, for the unsupervised dimension reduction and visualization of scRNA-seq data. It can explicitly model the dropout events and find the nonlinear hierarchical feature representations of the original data. Tested on twenty datasets, VASC shows superior performances in most cases and broader dataset compatibility compared with four state-of-the-art dimension reduction methods. Then, for a case study of pre-implantation embryos, VASC successfully re-establishes the cell dynamics and identifies several candidate marker genes associated with the early embryo development.


2019 ◽  
Author(s):  
Svetlana Ovchinnikova ◽  
Simon Anders

AbstractDimension-reduction methods, such as t-SNE or UMAP, are widely used when exploring high-dimensional data describing many entities, e.g., RNA-seq data for many single cells. However, dimension reduction is commonly prone to introducing artefacts, and we hence need means to see where a dimension-reduced embedding is a faithful representation of the local neighbourhood and where it is not.We present Sleepwalk, a simple but powerful tool that allows the user to interactively explore an embedding, using colour to depict original or any other distances from all points to the cell under the mouse cursor. We show how this approach not only highlights distortions, but also reveals otherwise hidden characteristics of the data, and how Sleep-walk’s comparative modes help integrate multi-sample data and understand differences between embedding and preprocessing methods. Sleepwalk is a versatile and intuitive tool that unlocks the full power of dimension reduction and will be of value not only in single-cell RNA-seq but also in any other area with matrix-shaped big data.


2018 ◽  
Author(s):  
Archit Verma ◽  
Barbara E. Engelhardt

AbstractModern developments in single cell sequencing technologies enable broad insights into cellular state. Single cell RNA sequencing (scRNA-seq) can be used to explore cell types, states, and developmental trajectories to broaden understanding of cell heterogeneity in tissues and organs. Analysis of these sparse, high-dimensional experimental results requires dimension reduction. Several methods have been developed to estimate low-dimensional embeddings for filtered and normalized single cell data. However, methods have yet to be developed for unfiltered and unnormalized count data. We present a nonlinear latent variable model with robust, heavy-tailed error and adaptive kernel learning to estimate low-dimensional nonlinear structure in scRNA-seq data. Gene expression in a single cell is modeled as a noisy draw from a Gaussian process in high dimensions from low-dimensional latent positions. This model is called the Gaussian process latent variable model (GPLVM). We model residual errors with a heavy-tailed Student’s t-distribution to estimate a manifold that is robust to technical and biological noise. We compare our approach to common dimension reduction tools to highlight our model’s ability to enable important downstream tasks, including clustering and inferring cell developmental trajectories, on available experimental data. We show that our robust nonlinear manifold is well suited for raw, unfiltered gene counts from high throughput sequencing technologies for visualization and exploration of cell states.


2021 ◽  
Author(s):  
Suman Gare ◽  
Soumita Chel ◽  
Manohar Kuruba ◽  
Soumya Jana ◽  
Lopamudra Giri

2021 ◽  
Vol 2 (2) ◽  
pp. 100450
Author(s):  
Bob Chen ◽  
Marisol A. Ramirez-Solano ◽  
Cody N. Heiser ◽  
Qi Liu ◽  
Ken S. Lau

2018 ◽  
Author(s):  
Shibiao Wan ◽  
Junil Kim ◽  
Kyoung Jae Won

ABSTRACTTo process large-scale single-cell RNA-sequencing (scRNA-seq) data effectively without excessive distortion during dimension reduction, we present SHARP, an ensemble random projection-based algorithm which is scalable to clustering 10 million cells. Comprehensive benchmarking tests on 17 public scRNA-seq datasets demonstrate that SHARP outperforms existing methods in terms of speed and accuracy. Particularly, for large-size datasets (>40,000 cells), SHARP’s running speed far excels other competitors while maintaining high clustering accuracy and robustness. To the best of our knowledge, SHARP is the only R-based tool that is scalable to clustering scRNA-seq data with 10 million cells.


Sign in / Sign up

Export Citation Format

Share Document