Single-cell RNA-seq data clustering: A survey with performance comparison study

Clustering analysis has been widely applied to single-cell RNA-sequencing (scRNA-seq) data to discover cell types and cell states. Algorithms developed in recent years have greatly helped the understanding of cellular heterogeneity and the underlying mechanisms of biological processes. However, these algorithms often use different techniques, were evaluated on different datasets and compared with some of their counterparts usually using different performance metrics. Consequently, there lacks an accurate and complete picture of their merits and demerits, which makes it difficult for users to select proper algorithms for analyzing their data. To fill this gap, we first do a review on the major existing scRNA-seq data clustering methods, and then conduct a comprehensive performance comparison among them from multiple perspectives. We consider 13 state of the art scRNA-seq data clustering algorithms, and collect 12 publicly available real scRNA-seq datasets from the existing works to evaluate and compare these algorithms. Our comparative study shows that the existing methods are very diverse in performance. Even the top-performance algorithms do not perform well on all datasets, especially those with complex structures. This suggests that further research is required to explore more stable, accurate, and efficient clustering algorithms for scRNA-seq data.

Download Full-text

Clustering Deviation Index (CDI): A robust and accurate unsupervised measure for evaluating scRNA-seq data clustering

10.1101/2022.01.03.474840 ◽

2022 ◽

Author(s):

Jiyuan Fang ◽

Cliburn Chan ◽

Kouros Owzar ◽

Liuyang Wang ◽

Diyuan Qin ◽

...

Keyword(s):

Single Cell ◽

Data Clustering ◽

Goodness Of Fit ◽

Cellular Heterogeneity ◽

Clustering Methods ◽

Tuning Parameters ◽

Deviation Index ◽

Cell Clustering ◽

Single Cell Rna Sequencing ◽

Cell Data

Single-cell RNA-sequencing (scRNA-seq) technology allows us to explore cellular heterogeneity in the transcriptome. Because most scRNA-seq data analyses begin with cell clustering, its accuracy considerably impacts the validity of downstream analyses. Although many clustering methods have been developed, few tools are available to evaluate the clustering "goodness-of-fit" to the scRNA-seq data. In this paper, we propose a new Clustering Deviation Index (CDI) that measures the deviation of any clustering label set from the observed single-cell data. We conduct in silico and experimental scRNA-seq studies to show that CDI can select the optimal clustering label set. Particularly, CDI also informs the optimal tuning parameters for any given clustering method and the correct number of cluster components.

Download Full-text

Robust clustering and interpretation of scRNA-seq data using reference component analysis

10.1101/2021.02.16.431527 ◽

2021 ◽

Author(s):

Florian Schmidt ◽

Bobby Ranjan ◽

Quy Xiao Xuan Lin ◽

Vaidehi Krishnan ◽

Ignasius Joanito ◽

...

Keyword(s):

Single Cell ◽

De Novo ◽

Clustering Algorithms ◽

Cell Types ◽

Unsupervised Clustering ◽

Data Sets ◽

Clustering Methods ◽

Robust Clustering ◽

Supervised Clustering ◽

Downstream Analysis

MotivationThe transcriptomic diversity of the hundreds of cell types in the human body can be analysed in unprecedented detail using single cell (SC) technologies. Though clustering of cellular transcriptomes is the default technique for defining cell types and subtypes, single cell clustering can be strongly influenced by technical variation. In fact, the prevalent unsupervised clustering algorithms can cluster cells by technical, rather than biological, variation.ResultsCompared to de novo (unsupervised) clustering methods, we demonstrate using multiple benchmarks that supervised clustering, which uses reference transcriptomes as a guide, is robust to batch effects. To leverage the advantages of supervised clustering, we present RCA2, a new, scalable, and broadly applicable version of our RCA algorithm. RCA2 provides a user-friendly framework for supervised clustering and downstream analysis of large scRNA-seq data sets. RCA2 can be seamlessly incorporated into existing algorithmic pipelines. It incorporates various new reference panels for human and mouse, supports generation of custom panels and uses efficient graph-based clustering and sparse data structures to ensure scalability. We demonstrate the applicability of RCA2 on SC data from human bone marrow, healthy PBMCs and PBMCs from COVID-19 patients. Importantly, RCA2 facilitates cell-type-specific QC, which we show is essential for accurate clustering of SC data from heterogeneous tissues. In the era of cohort-scale SC analysis, supervised clustering methods such as RCA2 will facilitate unified analysis of diverse SC datasets.AvailabilityRCA2 is implemented in R and is available at github.com/prabhakarlab/RCAv2

Download Full-text

Evaluating single-cell cluster stability using the Jaccard similarity index

Bioinformatics ◽

10.1093/bioinformatics/btaa956 ◽

2020 ◽

Author(s):

Ming Tang ◽

Yasin Kaymaz ◽

Brandon L Logeman ◽

Stephen Eichhorn ◽

Zhengzheng S Liang ◽

...

Keyword(s):

Single Cell ◽

Nearest Neighbor ◽

Clustering Algorithms ◽

Similarity Index ◽

Cell Types ◽

R Package ◽

Clustering Methods ◽

K Nearest Neighbor ◽

Jaccard Similarity ◽

Cluster Stability

Abstract Motivation One major goal of single-cell RNA sequencing (scRNAseq) experiments is to identify novel cell types. With increasingly large scRNAseq datasets, unsupervised clustering methods can now produce detailed catalogues of transcriptionally distinct groups of cells in a sample. However, the interpretation of these clusters is challenging for both technical and biological reasons. Popular clustering algorithms are sensitive to parameter choices, and can produce different clustering solutions with even small changes in the number of principal components used, the k nearest neighbor and the resolution parameters, among others. Results Here, we present a set of tools to evaluate cluster stability by subsampling, which can guide parameter choice and aid in biological interpretation. The R package scclusteval and the accompanying Snakemake workflow implement all steps of the pipeline: subsampling the cells, repeating the clustering with Seurat and estimation of cluster stability using the Jaccard similarity index and providing rich visualizations. Availabilityand implementation R package scclusteval: https://github.com/crazyhottommy/scclusteval Snakemake workflow: https://github.com/crazyhottommy/pyflow_seuratv3_parameter Tutorial: https://crazyhottommy.github.io/EvaluateSingleCellClustering/.

Download Full-text

A completely parameter-free method for graph-based single cell RNA-seq clustering

10.1101/2021.07.15.452521 ◽

2021 ◽

Author(s):

Maryam Zand ◽

Jianhua Ruan

Keyword(s):

Single Cell ◽

Cell Population ◽

Nearest Neighbor ◽

Expression Profiles ◽

Clustering Algorithms ◽

Cell Types ◽

Clustering Methods ◽

K Nearest Neighbor ◽

Synthetic Datasets ◽

Almost All

Single-cell RNA sequencing (scRNAseq) offers an unprecedented potential for scrutinizing complex biological systems at single cell resolution. One of the most important applications of scRNAseq is to cluster cells into groups of similar expression profiles, which allows unsupervised identification of novel cell subtypes. While many clustering algorithms have been tested towards this goal, graph-based algorithms appear to be the most effective, due to their ability to accommodate the sparsity of the data, as well as the complex topology of the cell population. An integral part of almost all such clustering methods is the construction of a k-nearest-neighbor (KNN) network, and the choice of k, implicitly or explicitly, can have a profound impact on the density distribution of the graph and the structure of the resulting clusters, as well as the resolution of clusters that one can successfully identify from the data. In this work, we propose a fairly simple but robust approach to estimate the best k for constructing the KNN graph while simultaneously identifying the optimal clustering structure from the graph. Our method, named scQcut, employs a topology-based criterion to guide the construction of KNN graph, and then applies an efficient modularity-based community discovery algorithm to predict robust cell clusters. The results obtained from applying scQcut on a large number of real and synthetic datasets demonstrated that scQcut-which does not require any user-tuned parameters-outperformed several popular state-of-the-art clustering methods in terms of clustering accuracy and the ability to correctly identify rare cell types. The promising results indicate that an accurate approximation of the parameter k, which determines the topology of the network, is a crucial element of a successful graph-based clustering method to recover the final community structure of the cell population.

Download Full-text

Single-cell transcriptomics following ischemic injury identifies a role for B2M in cardiac repair

Communications Biology ◽

10.1038/s42003-020-01636-3 ◽

2021 ◽

Vol 4 (1) ◽

Author(s):

Bas Molenaar ◽

Louk T. Timmer ◽

Marjolein Droog ◽

Ilaria Perini ◽

Danielle Versteeg ◽

...

Keyword(s):

Single Cell ◽

Communication Networks ◽

Cardiac Remodeling ◽

Cardiac Injury ◽

Ischemic Injury ◽

Cell Types ◽

Repair Process ◽

Cardiac Repair ◽

Cellular Heterogeneity ◽

Intercellular Signaling

AbstractThe efficiency of the repair process following ischemic cardiac injury is a crucial determinant for the progression into heart failure and is controlled by both intra- and intercellular signaling within the heart. An enhanced understanding of this complex interplay will enable better exploitation of these mechanisms for therapeutic use. We used single-cell transcriptomics to collect gene expression data of all main cardiac cell types at different time-points after ischemic injury. These data unveiled cellular and transcriptional heterogeneity and changes in cellular function during cardiac remodeling. Furthermore, we established potential intercellular communication networks after ischemic injury. Follow up experiments confirmed that cardiomyocytes express and secrete elevated levels of beta-2 microglobulin in response to ischemic damage, which can activate fibroblasts in a paracrine manner. Collectively, our data indicate phase-specific changes in cellular heterogeneity during different stages of cardiac remodeling and allow for the identification of therapeutic targets relevant for cardiac repair.

Download Full-text

Single-cell data clustering based on sparse optimization and low-rank matrix factorization

G3 Genes|Genome|Genetics ◽

10.1093/g3journal/jkab098 ◽

2021 ◽

Author(s):

Yinlei Hu ◽

Bin Li ◽

Falai Chen ◽

Kun Qu

Keyword(s):

Single Cell ◽

Rna Sequencing ◽

Matrix Factorization ◽

Data Clustering ◽

Cell Types ◽

Low Rank ◽

Sequencing Data ◽

Rank Matrix ◽

Single Cell Rna Sequencing ◽

Low Rank Matrix

Abstract Unsupervised clustering is a fundamental step of single-cell RNA sequencing data analysis. This issue has inspired several clustering methods to classify cells in single-cell RNA sequencing data. However, accurate prediction of the cell clusters remains a substantial challenge. In this study, we propose a new algorithm for single-cell RNA sequencing data clustering based on Sparse Optimization and low-rank matrix factorization (scSO). We applied our scSO algorithm to analyze multiple benchmark datasets and showed that the cluster number predicted by scSO was close to the number of reference cell types and that most cells were correctly classified. Our scSO algorithm is available at https://github.com/QuKunLab/scSO. Overall, this study demonstrates a potent cell clustering approach that can help researchers distinguish cell types in single-cell RNA sequencing data.

Download Full-text

Using single-cell cytometry to illustrate integrated multi-perspective evaluation of clustering algorithms using Pareto fronts

Bioinformatics ◽

10.1093/bioinformatics/btab038 ◽

2021 ◽

Author(s):

Givanna H Putri ◽

Irena Koprinska ◽

Thomas M Ashhurst ◽

Nicholas J C King ◽

Mark N Read

Keyword(s):

Single Cell ◽

Performance Metrics ◽

Clustering Algorithms ◽

Latin Hypercube Sampling ◽

Supplementary Information ◽

Sequencing Data ◽

Evaluation Protocol ◽

Benchmark Datasets ◽

Pareto Fronts ◽

Parameter Values

Abstract Motivation Many ‘automated gating’ algorithms now exist to cluster cytometry and single-cell sequencing data into discrete populations. Comparative algorithm evaluations on benchmark datasets rely either on a single performance metric, or a few metrics considered independently of one another. However, single metrics emphasize different aspects of clustering performance and do not rank clustering solutions in the same order. This underlies the lack of consensus between comparative studies regarding optimal clustering algorithms and undermines the translatability of results onto other non-benchmark datasets. Results We propose the Pareto fronts framework as an integrative evaluation protocol, wherein individual metrics are instead leveraged as complementary perspectives. Judged superior are algorithms that provide the best trade-off between the multiple metrics considered simultaneously. This yields a more comprehensive and complete view of clustering performance. Moreover, by broadly and systematically sampling algorithm parameter values using the Latin Hypercube sampling method, our evaluation protocol minimizes (un)fortunate parameter value selections as confounding factors. Furthermore, it reveals how meticulously each algorithm must be tuned in order to obtain good results, vital knowledge for users with novel data. We exemplify the protocol by conducting a comparative study between three clustering algorithms (ChronoClust, FlowSOM and Phenograph) using four common performance metrics applied across four cytometry benchmark datasets. To our knowledge, this is the first time Pareto fronts have been used to evaluate the performance of clustering algorithms in any application domain. Availability and implementation Implementation of our Pareto front methodology and all scripts and datasets to reproduce this article are available at https://github.com/ghar1821/ParetoBench. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Selecting single cell clustering parameter values using subsampling-based robustness metrics

BMC Bioinformatics ◽

10.1186/s12859-021-03957-4 ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Ryan B. Patterson-Cross ◽

Ariel J. Levine ◽

Vilas Menon

Keyword(s):

Single Cell ◽

Optimal Parameter ◽

Clustering Algorithms ◽

Cell Types ◽

Parameter Selection ◽

Data Set ◽

Biologically Relevant ◽

Cell Clustering ◽

Parameter Values ◽

Robustness Metrics

Abstract Background Generating and analysing single-cell data has become a widespread approach to examine tissue heterogeneity, and numerous algorithms exist for clustering these datasets to identify putative cell types with shared transcriptomic signatures. However, many of these clustering workflows rely on user-tuned parameter values, tailored to each dataset, to identify a set of biologically relevant clusters. Whereas users often develop their own intuition as to the optimal range of parameters for clustering on each data set, the lack of systematic approaches to identify this range can be daunting to new users of any given workflow. In addition, an optimal parameter set does not guarantee that all clusters are equally well-resolved, given the heterogeneity in transcriptomic signatures in most biological systems. Results Here, we illustrate a subsampling-based approach (chooseR) that simultaneously guides parameter selection and characterizes cluster robustness. Through bootstrapped iterative clustering across a range of parameters, chooseR was used to select parameter values for two distinct clustering workflows (Seurat and scVI). In each case, chooseR identified parameters that produced biologically relevant clusters from both well-characterized (human PBMC) and complex (mouse spinal cord) datasets. Moreover, it provided a simple “robustness score” for each of these clusters, facilitating the assessment of cluster quality. Conclusion chooseR is a simple, conceptually understandable tool that can be used flexibly across clustering algorithms, workflows, and datasets to guide clustering parameter selection and characterize cluster robustness.

Download Full-text

Dissecting Cellular Heterogeneity Based on Network Denoising of scRNA-seq Using Local Scaling Self-Diffusion

Frontiers in Genetics ◽

10.3389/fgene.2021.811043 ◽

2022 ◽

Vol 12 ◽

Author(s):

Xin Duan ◽

Wei Wang ◽

Minghui Tang ◽

Feng Gao ◽

Xudong Lin

Keyword(s):

Metric Learning ◽

Cell Types ◽

Primary Objective ◽

The Self ◽

Cellular Heterogeneity ◽

Clustering Methods ◽

Local Scaling ◽

Self Diffusion ◽

Cell Clustering ◽

High Level

Identifying the phenotypes and interactions of various cells is the primary objective in cellular heterogeneity dissection. A key step of this methodology is to perform unsupervised clustering, which, however, often suffers challenges of the high level of noise, as well as redundant information. To overcome the limitations, we proposed self-diffusion on local scaling affinity (LSSD) to enhance cell similarities’ metric learning for dissecting cellular heterogeneity. Local scaling infers the self-tuning of cell-to-cell distances that are used to construct cell affinity. Our approach implements the self-diffusion process by propagating the affinity matrices to further improve the cell similarities for the downstream clustering analysis. To demonstrate the effectiveness and usefulness, we applied LSSD on two simulated and four real scRNA-seq datasets. Comparing with other single-cell clustering methods, our approach demonstrates much better clustering performance, and cell types identified on colorectal tumors reveal strongly biological interpretability.

Download Full-text

Single-cell analysis reveals cellular heterogeneity and molecular determinants of hypothalamic leptin-receptor cells

10.1101/2020.07.23.217729 ◽

2020 ◽

Author(s):

N. Kakava-Georgiadou ◽

J.F. Severens ◽

A.M. Jørgensen ◽

K.M. Garner ◽

M.C.M Luijendijk ◽

...

Keyword(s):

Single Cell ◽

Leptin Receptor ◽

Single Cell Analysis ◽

Cell Types ◽

Cellular Heterogeneity ◽

Molecular Signature ◽

Neuronal Populations ◽

Hypothalamic Nuclei ◽

Satiety Hormone ◽

Multiple Cell

AbstractHypothalamic nuclei which regulate homeostatic functions express leptin receptor (LepR), the primary target of the satiety hormone leptin. Single-cell RNA sequencing (scRNA-seq) has facilitated the discovery of a variety of hypothalamic cell types. However, low abundance of LepR transcripts prevented further characterization of LepR cells. Therefore, we perform scRNA-seq on isolated LepR cells and identify eight neuronal clusters, including three uncharacterized Trh-expressing populations as well as 17 non-neuronal populations including tanycytes, oligodendrocytes and endothelial cells. Food restriction had a major impact on Agrp neurons and changed the expression of obesity-associated genes. Multiple cell clusters were enriched for GWAS signals of obesity. We further explored changes in the gene regulatory landscape of LepR cell types. We thus reveal the molecular signature of distinct populations with diverse neurochemical profiles, which will aid efforts to illuminate the multi-functional nature of leptin’s action in the hypothalamus.

Download Full-text