Self-assembling Manifolds in Single-cell RNA Sequencing Data

AbstractSingle-cell RNA sequencing has spurred the development of computational methods that enable researchers to classify cell types, delineate developmental trajectories, and measure molecular responses to external perturbations. Many of these technologies rely on their ability to detect genes whose cell-to-cell variations arise from the biological processes of interest rather than transcriptional or technical noise. However, for datasets in which the biologically relevant differences between cells are subtle, identifying these genes is a challenging task. We present the self-assembling manifold (SAM) algorithm, an iterative soft feature selection strategy to quantify gene relevance and improve dimensionality reduction. We demonstrate its advantages over other state-of-the-art methods with experimental validation in identifying novel stem cell populations of Schistosoma, a prevalent parasite that infects hundreds of millions of people. Extending our analysis to a total of 56 datasets, we show that SAM is generalizable and consistently outperforms other methods in a variety of biological and quantitative benchmarks.

Download Full-text

Self-assembling manifolds in single-cell RNA sequencing data

eLife ◽

10.7554/elife.48994 ◽

2019 ◽

Vol 8 ◽

Cited By ~ 8

Author(s):

Alexander J Tarashansky ◽

Yuan Xue ◽

Pengyang Li ◽

Stephen R Quake ◽

Bo Wang

Keyword(s):

Single Cell ◽

Rna Sequencing ◽

Developmental Trajectories ◽

Cell Types ◽

Selection Strategy ◽

Sequencing Data ◽

Biologically Relevant ◽

Self Assembling ◽

Single Cell Rna Sequencing ◽

Stem Cell Populations

Single-cell RNA sequencing has spurred the development of computational methods that enable researchers to classify cell types, delineate developmental trajectories, and measure molecular responses to external perturbations. Many of these technologies rely on their ability to detect genes whose cell-to-cell variations arise from the biological processes of interest rather than transcriptional or technical noise. However, for datasets in which the biologically relevant differences between cells are subtle, identifying these genes is challenging. We present the self-assembling manifold (SAM) algorithm, an iterative soft feature selection strategy to quantify gene relevance and improve dimensionality reduction. We demonstrate its advantages over other state-of-the-art methods with experimental validation in identifying novel stem cell populations of Schistosoma mansoni, a prevalent parasite that infects hundreds of millions of people. Extending our analysis to a total of 56 datasets, we show that SAM is generalizable and consistently outperforms other methods in a variety of biological and quantitative benchmarks.

Download Full-text

IKAP—Identifying K mAjor cell Population groups in single-cell RNA-sequencing analysis

GigaScience ◽

10.1093/gigascience/giz121 ◽

2019 ◽

Vol 8 (10) ◽

Cited By ~ 2

Author(s):

Yun-Ching Chen ◽

Abhilash Suresh ◽

Chingiz Underbayev ◽

Clare Sun ◽

Komudi Singh ◽

...

Keyword(s):

Single Cell ◽

Rna Sequencing ◽

Cell Types ◽

Sequencing Analysis ◽

Sequencing Data ◽

Peripheral Blood Mononuclear ◽

Biologically Relevant ◽

Single Cell Rna Sequencing ◽

Cell Groups ◽

Cell Ontology

AbstractBackgroundIn single-cell RNA-sequencing analysis, clustering cells into groups and differentiating cell groups by differentially expressed (DE) genes are 2 separate steps for investigating cell identity. However, the ability to differentiate between cell groups could be affected by clustering. This interdependency often creates a bottleneck in the analysis pipeline, requiring researchers to repeat these 2 steps multiple times by setting different clustering parameters to identify a set of cell groups that are more differentiated and biologically relevant.FindingsTo accelerate this process, we have developed IKAP—an algorithm to identify major cell groups and improve differentiating cell groups by systematically tuning parameters for clustering. We demonstrate that, with default parameters, IKAP successfully identifies major cell types such as T cells, B cells, natural killer cells, and monocytes in 2 peripheral blood mononuclear cell datasets and recovers major cell types in a previously published mouse cortex dataset. These major cell groups identified by IKAP present more distinguishing DE genes compared with cell groups generated by different combinations of clustering parameters. We further show that cell subtypes can be identified by recursively applying IKAP within identified major cell types, thereby delineating cell identities in a multi-layered ontology.ConclusionsBy tuning the clustering parameters to identify major cell groups, IKAP greatly improves the automation of single-cell RNA-sequencing analysis to produce distinguishing DE genes and refine cell ontology using single-cell RNA-sequencing data.

Download Full-text

FIRM: Fast Integration of single-cell RNA-sequencing data across Multiple platforms

10.1101/2020.06.02.129031 ◽

2020 ◽

Author(s):

Jingsi Ming ◽

Zhixiang Lin ◽

Xiang Wan ◽

Can Yang ◽

Angela Ruohao Wu

Keyword(s):

Single Cell ◽

Rna Sequencing ◽

Cell Biology ◽

Developmental Trajectories ◽

Expression Patterns ◽

Cell Types ◽

Original Structure ◽

Cell Type ◽

Sequencing Data ◽

Single Cell Rna Sequencing

AbstractSingle-cell RNA-sequencing (scRNA-seq) has now been used extensively to discover novel cell types and reconstruct developmental trajectories by measuring mRNA expression patterns of individual cells. However, datasets collected using different scRNA-seq technology platforms, including the popular SMART-Seq2 (SS2) and 10X platforms, are difficult to compare because of their heterogeneity. Each platform has unique advantages, and integration of these datasets would provide deeper insights into cell biology and gene regulation. Through comprehensive data exploration, we found that accurate integration is often hampered by differences in cell-type compositions. Herein we describe FIRM, an algorithm that addresses this problem and achieves efficient and accurate integration of heterogeneous scRNA-seq datasets across multiple platforms. We applied FIRM to numerous scRNA-seq datasets generated using SS2 and 10X from mouse, mouse lemur, and human, comparing its performance in dataset integration with other state-of-the-art methods. The integrated datasets generated using FIRM show accurate mixing of shared cell type identities and superior preservation of original structure for each dataset. FIRM not only generates robust integrated datasets for downstream analysis, but is also a facile way to transfer cell type labels and annotations from one dataset to another, making it a versatile and indispensable tool for scRNA-seq analysis.

Download Full-text

Single-cell data clustering based on sparse optimization and low-rank matrix factorization

G3 Genes|Genome|Genetics ◽

10.1093/g3journal/jkab098 ◽

2021 ◽

Author(s):

Yinlei Hu ◽

Bin Li ◽

Falai Chen ◽

Kun Qu

Keyword(s):

Single Cell ◽

Rna Sequencing ◽

Matrix Factorization ◽

Data Clustering ◽

Cell Types ◽

Low Rank ◽

Sequencing Data ◽

Rank Matrix ◽

Single Cell Rna Sequencing ◽

Low Rank Matrix

Abstract Unsupervised clustering is a fundamental step of single-cell RNA sequencing data analysis. This issue has inspired several clustering methods to classify cells in single-cell RNA sequencing data. However, accurate prediction of the cell clusters remains a substantial challenge. In this study, we propose a new algorithm for single-cell RNA sequencing data clustering based on Sparse Optimization and low-rank matrix factorization (scSO). We applied our scSO algorithm to analyze multiple benchmark datasets and showed that the cluster number predicted by scSO was close to the number of reference cell types and that most cells were correctly classified. Our scSO algorithm is available at https://github.com/QuKunLab/scSO. Overall, this study demonstrates a potent cell clustering approach that can help researchers distinguish cell types in single-cell RNA sequencing data.

Download Full-text

Evaluation of single-cell classifiers for single-cell RNA sequencing data sets

Briefings in Bioinformatics ◽

10.1093/bib/bbz096 ◽

2019 ◽

Vol 21 (5) ◽

pp. 1581-1595 ◽

Cited By ~ 6

Author(s):

Xinlei Zhao ◽

Shuang Wu ◽

Nan Fang ◽

Xiao Sun ◽

Jue Fan

Keyword(s):

Single Cell ◽

Rna Sequencing ◽

Reference Data ◽

Predictive Accuracy ◽

Cell Types ◽

Superior Performance ◽

Marker Genes ◽

Data Sets ◽

Sequencing Data ◽

Single Cell Rna Sequencing

Abstract Single-cell RNA sequencing (scRNA-seq) has been rapidly developing and widely applied in biological and medical research. Identification of cell types in scRNA-seq data sets is an essential step before in-depth investigations of their functional and pathological roles. However, the conventional workflow based on clustering and marker genes is not scalable for an increasingly large number of scRNA-seq data sets due to complicated procedures and manual annotation. Therefore, a number of tools have been developed recently to predict cell types in new data sets using reference data sets. These methods have not been generally adapted due to a lack of tool benchmarking and user guidance. In this article, we performed a comprehensive and impartial evaluation of nine classification software tools specifically designed for scRNA-seq data sets. Results showed that Seurat based on random forest, SingleR based on correlation analysis and CaSTLe based on XGBoost performed better than others. A simple ensemble voting of all tools can improve the predictive accuracy. Under nonideal situations, such as small-sized and class-imbalanced reference data sets, tools based on cluster-level similarities have superior performance. However, even with the function of assigning ‘unassigned’ labels, it is still challenging to catch novel cell types by solely using any of the single-cell classifiers. This article provides a guideline for researchers to select and apply suitable classification tools in their analysis workflows and sheds some lights on potential direction of future improvement on classification tools.

Download Full-text

Splatter: simulation of single-cell RNA sequencing data

10.1101/133173 ◽

2017 ◽

Cited By ~ 8

Author(s):

Luke Zappia ◽

Belinda Phipson ◽

Alicia Oshlack

Keyword(s):

Single Cell ◽

Rna Sequencing ◽

Real Data ◽

Cell Types ◽

Rna Seq ◽

Sequencing Data ◽

Sequencing Technologies ◽

Simulation Based ◽

Single Cell Rna Sequencing ◽

Multiple Cell

AbstractAs single-cell RNA sequencing technologies have rapidly developed, so have analysis methods. Many methods have been tested, developed and validated using simulated datasets. Unfortunately, current simulations are often poorly documented, their similarity to real data is not demonstrated, or reproducible code is not available.Here we present the Splatter Bioconductor package for simple, reproducible and well-documented simulation of single-cell RNA-seq data. Splatter provides an interface to multiple simulation methods including Splat, our own simulation, based on a gamma-Poisson distribution. Splat can simulate single populations of cells, populations with multiple cell types or differentiation paths.

Download Full-text

Distinguishing cells from empty droplets in droplet-based single-cell RNA sequencing data

10.1101/234872 ◽

2018 ◽

Cited By ~ 7

Author(s):

Aaron T. L. Lun ◽

Samantha Riesenfeld ◽

Tallulah Andrews ◽

Tomas Gomes ◽

John C. Marioni ◽

...

Keyword(s):

Single Cell ◽

Rna Sequencing ◽

Real Data ◽

Cell Types ◽

Sequencing Data ◽

Minimum Threshold ◽

False Discovery ◽

Distinct Cell ◽

Single Cell Rna Sequencing ◽

Unique Molecular Identifier

AbstractDroplet-based single-cell RNA sequencing protocols have dramatically increased the throughput and efficiency of single-cell transcriptomics studies. A key computational challenge when processing these data is to distinguish libraries for real cells from empty droplets. Existing methods for cell calling set a minimum threshold on the total unique molecular identifier (UMI) count for each library, which indiscriminately discards cell libraries with low UMI counts. Here, we describe a new statistical method for calling cells from droplet-based data, based on detecting significant deviations from the expression profile of the ambient solution. Using simulations, we demonstrate that our method has greater power than existing approaches for detecting cell libraries with low UMI counts, while controlling the false discovery rate among detected cells. We also apply our method to real data, where we show that the use of our method results in the retention of distinct cell types that would otherwise have been discarded.

Download Full-text

scConsensus: combining supervised and unsupervised clustering for cell type identification in single-cell RNA sequencing data

BMC Bioinformatics ◽

10.1186/s12859-021-04028-4 ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Bobby Ranjan ◽

Florian Schmidt ◽

Wenjie Sun ◽

Jinyu Park ◽

Mohammad Amin Honardoost ◽

...

Keyword(s):

Single Cell ◽

Rna Sequencing ◽

Differentially Expressed Genes ◽

Cell Types ◽

Unsupervised Clustering ◽

Differentially Expressed ◽

Consensus Clustering ◽

Cell Type ◽

Sequencing Data ◽

Single Cell Rna Sequencing

Abstract Background Clustering is a crucial step in the analysis of single-cell data. Clusters identified in an unsupervised manner are typically annotated to cell types based on differentially expressed genes. In contrast, supervised methods use a reference panel of labelled transcriptomes to guide both clustering and cell type identification. Supervised and unsupervised clustering approaches have their distinct advantages and limitations. Therefore, they can lead to different but often complementary clustering results. Hence, a consensus approach leveraging the merits of both clustering paradigms could result in a more accurate clustering and a more precise cell type annotation. Results We present scConsensus, an $${\mathbf {R}}$$ R framework for generating a consensus clustering by (1) integrating results from both unsupervised and supervised approaches and (2) refining the consensus clusters using differentially expressed genes. The value of our approach is demonstrated on several existing single-cell RNA sequencing datasets, including data from sorted PBMC sub-populations. Conclusions scConsensus combines the merits of unsupervised and supervised approaches to partition cells with better cluster separation and homogeneity, thereby increasing our confidence in detecting distinct cell types. scConsensus is implemented in $${\mathbf {R}}$$ R and is freely available on GitHub at https://github.com/prabhakarlab/scConsensus.

Download Full-text

Characterizing the replicability of cell types defined by single cell RNA-sequencing data using MetaNeighbor

Nature Communications ◽

10.1038/s41467-018-03282-0 ◽

2018 ◽

Vol 9 (1) ◽

Cited By ~ 87

Author(s):

Megan Crow ◽

Anirban Paul ◽

Sara Ballouz ◽

Z. Josh Huang ◽

Jesse Gillis

Keyword(s):

Single Cell ◽

Rna Sequencing ◽

Cell Types ◽

Sequencing Data ◽

Single Cell Rna Sequencing

Download Full-text

scConsensus: combining supervised and unsupervised clustering for cell type identification in single-cell RNA sequencing data

10.1101/2020.04.22.056473 ◽

2020 ◽

Author(s):

Bobby Ranjan ◽

Florian Schmidt ◽

Wenjie Sun ◽

Jinyu Park ◽

Mohammad Amin Honardoost ◽

...

Keyword(s):

Single Cell ◽

Rna Sequencing ◽

Cell Types ◽

Unsupervised Clustering ◽

Differentially Expressed ◽

Consensus Clustering ◽

Cell Type ◽

Sequencing Data ◽

Single Cell Rna Sequencing ◽

Data Clusters

Clustering is a crucial step in the analysis of single-cell data. Clusters identified using unsupervised clustering are typically annotated to cell types based on differentially expressed genes. In contrast, supervised methods use a reference panel of labelled transcriptomes to guide both clustering and cell type identification. Supervised and unsupervised clustering strategies have their distinct advantages and limitations. Therefore, they can lead to different but often complementary clustering results. Hence, a consensus approach leveraging the merits of both clustering paradigms could result in a more accurate clustering and a more precise cell type annotation. We present scConsensus, an R framework for generating a consensus clustering by (i) integrating the results from both unsupervised and supervised approaches and (ii) refining the consensus clusters using differentially expressed (DE) genes. The value of our approach is demonstrated on several existing single-cell RNA sequencing datasets, including data from sorted PBMC sub-populations. scConsensus is freely available on GitHub at https://github.com/prabhakarlab/scConsensus.

Download Full-text