scholarly journals Clustering Deviation Index (CDI): A robust and accurate unsupervised measure for evaluating scRNA-seq data clustering

2022 ◽  
Author(s):  
Jiyuan Fang ◽  
Cliburn Chan ◽  
Kouros Owzar ◽  
Liuyang Wang ◽  
Diyuan Qin ◽  
...  

Single-cell RNA-sequencing (scRNA-seq) technology allows us to explore cellular heterogeneity in the transcriptome. Because most scRNA-seq data analyses begin with cell clustering, its accuracy considerably impacts the validity of downstream analyses. Although many clustering methods have been developed, few tools are available to evaluate the clustering "goodness-of-fit" to the scRNA-seq data. In this paper, we propose a new Clustering Deviation Index (CDI) that measures the deviation of any clustering label set from the observed single-cell data. We conduct in silico and experimental scRNA-seq studies to show that CDI can select the optimal clustering label set. Particularly, CDI also informs the optimal tuning parameters for any given clustering method and the correct number of cluster components.

2020 ◽  
Vol 18 (04) ◽  
pp. 2040005
Author(s):  
Ruiyi Li ◽  
Jihong Guan ◽  
Shuigeng Zhou

Clustering analysis has been widely applied to single-cell RNA-sequencing (scRNA-seq) data to discover cell types and cell states. Algorithms developed in recent years have greatly helped the understanding of cellular heterogeneity and the underlying mechanisms of biological processes. However, these algorithms often use different techniques, were evaluated on different datasets and compared with some of their counterparts usually using different performance metrics. Consequently, there lacks an accurate and complete picture of their merits and demerits, which makes it difficult for users to select proper algorithms for analyzing their data. To fill this gap, we first do a review on the major existing scRNA-seq data clustering methods, and then conduct a comprehensive performance comparison among them from multiple perspectives. We consider 13 state of the art scRNA-seq data clustering algorithms, and collect 12 publicly available real scRNA-seq datasets from the existing works to evaluate and compare these algorithms. Our comparative study shows that the existing methods are very diverse in performance. Even the top-performance algorithms do not perform well on all datasets, especially those with complex structures. This suggests that further research is required to explore more stable, accurate, and efficient clustering algorithms for scRNA-seq data.


2021 ◽  
Vol 17 (1) ◽  
pp. e1008625
Author(s):  
Stephanie C. Hicks ◽  
Ruoxi Liu ◽  
Yuwei Ni ◽  
Elizabeth Purdom ◽  
Davide Risso

Single-cell RNA-Sequencing (scRNA-seq) is the most widely used high-throughput technology to measure genome-wide gene expression at the single-cell level. One of the most common analyses of scRNA-seq data detects distinct subpopulations of cells through the use of unsupervised clustering algorithms. However, recent advances in scRNA-seq technologies result in current datasets ranging from thousands to millions of cells. Popular clustering algorithms, such as k-means, typically require the data to be loaded entirely into memory and therefore can be slow or impossible to run with large datasets. To address this problem, we developed the mbkmeans R/Bioconductor package, an open-source implementation of the mini-batch k-means algorithm. Our package allows for on-disk data representations, such as the common HDF5 file format widely used for single-cell data, that do not require all the data to be loaded into memory at one time. We demonstrate the performance of the mbkmeans package using large datasets, including one with 1.3 million cells. We also highlight and compare the computing performance of mbkmeans against the standard implementation of k-means and other popular single-cell clustering methods. Our software package is available in Bioconductor at https://bioconductor.org/packages/mbkmeans.


2021 ◽  
Vol 11 ◽  
Author(s):  
Jujuan Zhuang ◽  
Changjing Ren ◽  
Dan Ren ◽  
Yu’ang Li ◽  
Danyang Liu ◽  
...  

Critical in revealing cell heterogeneity and identifying new cell subtypes, cell clustering based on single-cell RNA sequencing (scRNA-seq) is challenging. Due to the high noise, sparsity, and poor annotation of scRNA-seq data, existing state-of-the-art cell clustering methods usually ignore gene functions and gene interactions. In this study, we propose a feature extraction method, named FEGFS, to analyze scRNA-seq data, taking advantage of known gene functions. Specifically, we first derive the functional gene sets based on Gene Ontology (GO) terms and reduce their redundancy by semantic similarity analysis and gene repetitive rate reduction. Then, we apply the kernel principal component analysis to select features on each non-redundant functional gene set, and we combine the selected features (for each functional gene set) together for subsequent clustering analysis. To test the performance of FEGFS, we apply agglomerative hierarchical clustering based on FEGFS and compared it with seven state-of-the-art clustering methods on six real scRNA-seq datasets. For small datasets like Pollen and Goolam, FEGFS outperforms all methods on all four evaluation metrics including adjusted Rand index (ARI), normalized mutual information (NMI), homogeneity score (HOM), and completeness score (COM). For example, the ARIs of FEGFS are 0.955 and 0.910, respectively, on Pollen and Goolam; and those of the second-best method are only 0.938 and 0.910, respectively. For large datasets, FEGFS also outperforms most methods. For example, the ARIs of FEGFS are 0.781 on both Klein and Zeisel, which are higher than those of all other methods but slight lower than those of SC3 (0.798 and 0.807, respectively). Moreover, we demonstrate that CMF-Impute is powerful in reconstructing cell-to-cell and gene-to-gene correlation and in inferring cell lineage trajectories. As for application, take glioma as an example; we demonstrated that our clustering methods could identify important cell clusters related to glioma and also inferred key marker genes related to these cell clusters.


2020 ◽  
Author(s):  
Jia Song ◽  
Yao Liu ◽  
Xuebing Zhang ◽  
Qiuyue Wu ◽  
Juan Gao ◽  
...  

Abstract Single-cell RNA sequencing enables us to characterize the cellular heterogeneity in single cell resolution with the help of cell type identification algorithms. However, the noise inherent in single-cell RNA-sequencing data severely disturbs the accuracy of cell clustering, marker identification and visualization. We propose that clustering based on feature density profiles can distinguish informative features from noise. We named such strategy as ‘entropy subspace’ separation and designed a cell clustering algorithm called ENtropy subspace separation-based Clustering for nOise REduction (ENCORE) by integrating the ‘entropy subspace’ separation strategy with a consensus clustering method. We demonstrate that ENCORE performs superiorly on cell clustering and generates high-resolution visualization across 12 standard datasets. More importantly, ENCORE enables identification of group markers with biological significance from a hard-to-separate dataset. With the advantages of effective feature selection, improved clustering, accurate marker identification and high-resolution visualization, we present ENCORE to the community as an important tool for scRNA-seq data analysis to study cellular heterogeneity and discover group markers.


2018 ◽  
Author(s):  
Meghan C. Ferrall-Fairbanks ◽  
Markus Ball ◽  
Eric Padron ◽  
Philipp M. Altrock

ABSTRACTPURPOSEMany cancers can be treated with targeted therapy. Almost inevitably, tumors develop resistance to targeted therapy, either from preexistence or by evolving new genotypes and traits. Intra-tumor heterogeneity serves as a reservoir for resistance, which often occurs due to selection of minor cellular sub-clones. On the level of gene expression, the ‘clonal’ heterogeneity can only be revealed by high-dimensional single cell methods. We propose to use a general diversity index (GDI) to quantify heterogeneity on multiple scales and relate it to disease evolution.METHODSWe focused on individual patient samples probed with single cell RNA sequencing to describe heterogeneity. We developed a pipeline to analyze single cell data, via sample normalization, clustering and mathematical interpretation using a generalized diversity measure, and exemplify the utility of this platform using single cell data.RESULTSWe focused on three sources of RNA sequencing data: two healthy bone marrow (BM) samples, two acute myeloid leukemia (AML) patients, each sampled before and after BM transplant (BMT), four samples of pre-sorted lineages, and six lung carcinoma patients with multi-region sampling. While healthy/normal samples scored low in diversity overall, GDI further quantified in which respect these samples differed. While a widely used Shannon diversity index sometimes reveals less differences, GDI exhibits differences in the number of potential key drivers or clonal richness. Comparing pre and post BMT AML samples did not reveal differences in heterogeneity, although they can be very different biologically.CONCLUSIONGDI can quantify cellular heterogeneity changes across a wide spectrum, even when standard measures, such as the Shannon index, do not. Our approach offers wide applications to quantify heterogeneity across samples and conditions.


2021 ◽  
Vol 13 (1) ◽  
Author(s):  
Sunny Z. Wu ◽  
Daniel L. Roden ◽  
Ghamdan Al-Eryani ◽  
Nenad Bartonicek ◽  
Kate Harvey ◽  
...  

Abstract Background High throughput single-cell RNA sequencing (scRNA-Seq) has emerged as a powerful tool for exploring cellular heterogeneity among complex human cancers. scRNA-Seq studies using fresh human surgical tissue are logistically difficult, preclude histopathological triage of samples, and limit the ability to perform batch processing. This hindrance can often introduce technical biases when integrating patient datasets and increase experimental costs. Although tissue preservation methods have been previously explored to address such issues, it is yet to be examined on complex human tissues, such as solid cancers and on high throughput scRNA-Seq platforms. Methods Using the Chromium 10X platform, we sequenced a total of ~ 120,000 cells from fresh and cryopreserved replicates across three primary breast cancers, two primary prostate cancers and a cutaneous melanoma. We performed detailed analyses between cells from each condition to assess the effects of cryopreservation on cellular heterogeneity, cell quality, clustering and the identification of gene ontologies. In addition, we performed single-cell immunophenotyping using CITE-Seq on a single breast cancer sample cryopreserved as solid tissue fragments. Results Tumour heterogeneity identified from fresh tissues was largely conserved in cryopreserved replicates. We show that sequencing of single cells prepared from cryopreserved tissue fragments or from cryopreserved cell suspensions is comparable to sequenced cells prepared from fresh tissue, with cryopreserved cell suspensions displaying higher correlations with fresh tissue in gene expression. We showed that cryopreservation had minimal impacts on the results of downstream analyses such as biological pathway enrichment. For some tumours, cryopreservation modestly increased cell stress signatures compared to freshly analysed tissue. Further, we demonstrate the advantage of cryopreserving whole-cells for detecting cell-surface proteins using CITE-Seq, which is impossible using other preservation methods such as single nuclei-sequencing. Conclusions We show that the viable cryopreservation of human cancers provides high-quality single-cells for multi-omics analysis. Our study guides new experimental designs for tissue biobanking for future clinical single-cell RNA sequencing studies.


2020 ◽  
Vol 22 (Supplement_3) ◽  
pp. iii406-iii406
Author(s):  
Andrew Donson ◽  
Kent Riemondy ◽  
Sujatha Venkataraman ◽  
Ahmed Gilani ◽  
Bridget Sanford ◽  
...  

Abstract We explored cellular heterogeneity in medulloblastoma using single-cell RNA sequencing (scRNAseq), immunohistochemistry and deconvolution of bulk transcriptomic data. Over 45,000 cells from 31 patients from all main subgroups of medulloblastoma (2 WNT, 10 SHH, 9 GP3, 11 GP4 and 1 GP3/4) were clustered using Harmony alignment to identify conserved subpopulations. Each subgroup contained subpopulations exhibiting mitotic, undifferentiated and neuronal differentiated transcript profiles, corroborating other recent medulloblastoma scRNAseq studies. The magnitude of our present study builds on the findings of existing studies, providing further characterization of conserved neoplastic subpopulations, including identification of a photoreceptor-differentiated subpopulation that was predominantly, but not exclusively, found in GP3 medulloblastoma. Deconvolution of MAGIC transcriptomic cohort data showed that neoplastic subpopulations are associated with major and minor subgroup subdivisions, for example, photoreceptor subpopulation cells are more abundant in GP3-alpha. In both GP3 and GP4, higher proportions of undifferentiated subpopulations is associated with shorter survival and conversely, differentiated subpopulation is associated with longer survival. This scRNAseq dataset also afforded unique insights into the immune landscape of medulloblastoma, and revealed an M2-polarized myeloid subpopulation that was restricted to SHH medulloblastoma. Additionally, we performed scRNAseq on 16,000 cells from genetically engineered mouse (GEM) models of GP3 and SHH medulloblastoma. These models showed a level of fidelity with corresponding human subgroup-specific neoplastic and immune subpopulations. Collectively, our findings advance our understanding of the neoplastic and immune landscape of the main medulloblastoma subgroups in both humans and GEM models.


Author(s):  
Yinlei Hu ◽  
Bin Li ◽  
Falai Chen ◽  
Kun Qu

Abstract Unsupervised clustering is a fundamental step of single-cell RNA sequencing data analysis. This issue has inspired several clustering methods to classify cells in single-cell RNA sequencing data. However, accurate prediction of the cell clusters remains a substantial challenge. In this study, we propose a new algorithm for single-cell RNA sequencing data clustering based on Sparse Optimization and low-rank matrix factorization (scSO). We applied our scSO algorithm to analyze multiple benchmark datasets and showed that the cluster number predicted by scSO was close to the number of reference cell types and that most cells were correctly classified. Our scSO algorithm is available at https://github.com/QuKunLab/scSO. Overall, this study demonstrates a potent cell clustering approach that can help researchers distinguish cell types in single-cell RNA sequencing data.


2021 ◽  
Vol 12 (1) ◽  
Author(s):  
Dandan Cao ◽  
Rachel W. S. Chan ◽  
Ernest H. Y. Ng ◽  
Kristina Gemzell-Danielsson ◽  
William S. B. Yeung

Abstract Background Endometrial mesenchymal-like stromal/stem cells (eMSCs) have been proposed as adult stem cells contributing to endometrial regeneration. One set of perivascular markers (CD140b&CD146) has been widely used to enrich eMSCs. Although eMSCs are easily accessible for regenerative medicine and have long been studied, their cellular heterogeneity, relationship to primary counterpart, remains largely unclear. Methods In this study, we applied 10X genomics single-cell RNA sequencing (scRNA-seq) to cultured human CD140b+CD146+ endometrial perivascular cells (ePCs) from menstrual and secretory endometrium. We also analyzed publicly available scRNA-seq data of primary endometrium and performed transcriptome comparison between cultured ePCs and primary ePCs at single-cell level. Results Transcriptomic expression-based clustering revealed limited heterogeneity within cultured menstrual and secretory ePCs. A main subpopulation and a small stress-induced subpopulation were identified in secretory and menstrual ePCs. Cell identity analysis demonstrated the similar cellular composition in secretory and menstrual ePCs. Marker gene expression analysis showed that the main subpopulations identified from cultured secretory and menstrual ePCs simultaneously expressed genes marking mesenchymal stem cell (MSC), perivascular cell, smooth muscle cell, and stromal fibroblast. GO enrichment analysis revealed that genes upregulated in the main subpopulation enriched in actin filament organization, cellular division, etc., while genes upregulated in the small subpopulation enriched in extracellular matrix disassembly, stress response, etc. By comparing subpopulations of cultured ePCs to the publicly available primary endometrial cells, it was found that the main subpopulation identified from cultured ePCs was culture-unique which was unlike primary ePCs or primary endometrial stromal fibroblast cells. Conclusion In summary, these data for the first time provides a single-cell atlas of the cultured human CD140b+CD146+ ePCs. The identification of culture-unique relatively homogenous cell population of CD140b+CD146+ ePCs underscores the importance of in vivo microenvironment in maintaining cellular identity.


2022 ◽  
Vol 12 ◽  
Author(s):  
Xin Duan ◽  
Wei Wang ◽  
Minghui Tang ◽  
Feng Gao ◽  
Xudong Lin

Identifying the phenotypes and interactions of various cells is the primary objective in cellular heterogeneity dissection. A key step of this methodology is to perform unsupervised clustering, which, however, often suffers challenges of the high level of noise, as well as redundant information. To overcome the limitations, we proposed self-diffusion on local scaling affinity (LSSD) to enhance cell similarities’ metric learning for dissecting cellular heterogeneity. Local scaling infers the self-tuning of cell-to-cell distances that are used to construct cell affinity. Our approach implements the self-diffusion process by propagating the affinity matrices to further improve the cell similarities for the downstream clustering analysis. To demonstrate the effectiveness and usefulness, we applied LSSD on two simulated and four real scRNA-seq datasets. Comparing with other single-cell clustering methods, our approach demonstrates much better clustering performance, and cell types identified on colorectal tumors reveal strongly biological interpretability.


Sign in / Sign up

Export Citation Format

Share Document