scholarly journals A completely parameter-free method for graph-based single cell RNA-seq clustering

2021 ◽  
Author(s):  
Maryam Zand ◽  
Jianhua Ruan

Single-cell RNA sequencing (scRNAseq) offers an unprecedented potential for scrutinizing complex biological systems at single cell resolution. One of the most important applications of scRNAseq is to cluster cells into groups of similar expression profiles, which allows unsupervised identification of novel cell subtypes. While many clustering algorithms have been tested towards this goal, graph-based algorithms appear to be the most effective, due to their ability to accommodate the sparsity of the data, as well as the complex topology of the cell population. An integral part of almost all such clustering methods is the construction of a k-nearest-neighbor (KNN) network, and the choice of k, implicitly or explicitly, can have a profound impact on the density distribution of the graph and the structure of the resulting clusters, as well as the resolution of clusters that one can successfully identify from the data. In this work, we propose a fairly simple but robust approach to estimate the best k for constructing the KNN graph while simultaneously identifying the optimal clustering structure from the graph. Our method, named scQcut, employs a topology-based criterion to guide the construction of KNN graph, and then applies an efficient modularity-based community discovery algorithm to predict robust cell clusters. The results obtained from applying scQcut on a large number of real and synthetic datasets demonstrated that scQcut-which does not require any user-tuned parameters-outperformed several popular state-of-the-art clustering methods in terms of clustering accuracy and the ability to correctly identify rare cell types. The promising results indicate that an accurate approximation of the parameter k, which determines the topology of the network, is a crucial element of a successful graph-based clustering method to recover the final community structure of the cell population.

Author(s):  
Ming Tang ◽  
Yasin Kaymaz ◽  
Brandon L Logeman ◽  
Stephen Eichhorn ◽  
Zhengzheng S Liang ◽  
...  

Abstract Motivation One major goal of single-cell RNA sequencing (scRNAseq) experiments is to identify novel cell types. With increasingly large scRNAseq datasets, unsupervised clustering methods can now produce detailed catalogues of transcriptionally distinct groups of cells in a sample. However, the interpretation of these clusters is challenging for both technical and biological reasons. Popular clustering algorithms are sensitive to parameter choices, and can produce different clustering solutions with even small changes in the number of principal components used, the k nearest neighbor and the resolution parameters, among others. Results Here, we present a set of tools to evaluate cluster stability by subsampling, which can guide parameter choice and aid in biological interpretation. The R package scclusteval and the accompanying Snakemake workflow implement all steps of the pipeline: subsampling the cells, repeating the clustering with Seurat and estimation of cluster stability using the Jaccard similarity index and providing rich visualizations. Availabilityand implementation R package scclusteval: https://github.com/crazyhottommy/scclusteval Snakemake workflow: https://github.com/crazyhottommy/pyflow_seuratv3_parameter Tutorial: https://crazyhottommy.github.io/EvaluateSingleCellClustering/.


Genes ◽  
2019 ◽  
Vol 10 (2) ◽  
pp. 98 ◽  
Author(s):  
Xiaoshu Zhu ◽  
Hong-Dong Li ◽  
Yunpei Xu ◽  
Lilu Guo ◽  
Fang-Xiang Wu ◽  
...  

Single-cell RNA sequencing (scRNA-seq) has recently brought new insight into cell differentiation processes and functional variation in cell subtypes from homogeneous cell populations. A lack of prior knowledge makes unsupervised machine learning methods, such as clustering, suitable for analyzing scRNA-seq . However, there are several limitations to overcome, including high dimensionality, clustering result instability, and parameter adjustment complexity. In this study, we propose a method by combining structure entropy and k nearest neighbor to identify cell subpopulations in scRNA-seq data. In contrast to existing clustering methods for identifying cell subtypes, minimized structure entropy results in natural communities without specifying the number of clusters. To investigate the performance of our model, we applied it to eight scRNA-seq datasets and compared our method with three existing methods (nonnegative matrix factorization, single-cell interpretation via multikernel learning, and structural entropy minimization principle). The experimental results showed that our approach achieves, on average, better performance in these datasets compared to the benchmark methods.


2020 ◽  
Vol 18 (04) ◽  
pp. 2040005
Author(s):  
Ruiyi Li ◽  
Jihong Guan ◽  
Shuigeng Zhou

Clustering analysis has been widely applied to single-cell RNA-sequencing (scRNA-seq) data to discover cell types and cell states. Algorithms developed in recent years have greatly helped the understanding of cellular heterogeneity and the underlying mechanisms of biological processes. However, these algorithms often use different techniques, were evaluated on different datasets and compared with some of their counterparts usually using different performance metrics. Consequently, there lacks an accurate and complete picture of their merits and demerits, which makes it difficult for users to select proper algorithms for analyzing their data. To fill this gap, we first do a review on the major existing scRNA-seq data clustering methods, and then conduct a comprehensive performance comparison among them from multiple perspectives. We consider 13 state of the art scRNA-seq data clustering algorithms, and collect 12 publicly available real scRNA-seq datasets from the existing works to evaluate and compare these algorithms. Our comparative study shows that the existing methods are very diverse in performance. Even the top-performance algorithms do not perform well on all datasets, especially those with complex structures. This suggests that further research is required to explore more stable, accurate, and efficient clustering algorithms for scRNA-seq data.


Author(s):  
Amine M. Bensaid ◽  
James C. Bezdek

This paper describes a class of models we call semi-supervised clustering. Algorithms in this category are clustering methods that use information possessed by labeled training data Xd⊂ ℜp as well as structural information that resides in the unlabeled data Xu⊂ ℜp. The labels are used in conjunction with the unlabeled data to help clustering algorithms partition Xu ⊂ ℜp which then terminate without the capability to label other points in ℜp. This is very different from supervised learning, wherein the training data subsequently endow a classifier with the ability to label every point in ℜp. The methodology is applicable in domains such as image segmentation, where users may have a small set of labeled data, and can use it to semi-supervise classification of the remaining pixels in a single image. The model can be used with many different point prototype clustering algorithms. We illustrate how to attach it to a particular algorithm (fuzzy c-means). Then we give two numerical examples to show that it overcomes the failure of many point prototype clustering schemes when confronted with data that possess overlapping and/or non uniformly distributed clusters. Finally, the new method compares favorably to the fully supervised k nearest neighbor rule when applied to the Iris data.


2017 ◽  
Author(s):  
Florian Wagner ◽  
Yun Yan ◽  
Itai Yanai

High-throughput single-cell RNA-Seq (scRNA-Seq) is a powerful approach for studying heterogeneous tissues and dynamic cellular processes. However, compared to bulk RNA-Seq, single-cell expression profiles are extremely noisy, as they only capture a fraction of the transcripts present in the cell. Here, we propose the k-nearest neighbor smoothing (kNN-smoothing) algorithm, designed to reduce noise by aggregating information from similar cells (neighbors) in a computationally efficient and statistically tractable manner. The algorithm is based on the observation that across protocols, the technical noise exhibited by UMI-filtered scRNA-Seq data closely follows Poisson statistics. Smoothing is performed by first identifying the nearest neighbors of each cell in a step-wise fashion, based on partially smoothed and variance-stabilized expression profiles, and then aggregating their transcript counts. We show that kNN-smoothing greatly improves the detection of clusters of cells and co-expressed genes, and clearly outperforms other smoothing methods on simulated data. To accurately perform smoothing for datasets containing highly similar cell populations, we propose the kNN-smoothing 2 algorithm, in which neighbors are determined after projecting the partially smoothed data onto the first few principal components. We show that unlike its predecessor, kNN-smoothing 2 can accurately distinguish between cells from different T cell subsets, and enables their identification in peripheral blood using unsupervised methods. Our work facilitates the analysis of scRNA-Seq data across a broad range of applications, including the identification of cell populations in heterogeneous tissues and the characterization of dynamic processes such as cellular differentiation. Reference implementations of our algorithms can be found at https://github.com/yanailab/knn-smoothing.


2021 ◽  
Vol 11 ◽  
Author(s):  
Xiaoteng Cui ◽  
Qixue Wang ◽  
Junhu Zhou ◽  
Yunfei Wang ◽  
Can Xu ◽  
...  

BackgroundThe main immune cells in GBM are tumor-associated macrophages (TAMs). Thus far, the studies investigating the activation status of TAM in GBM are mainly limited to bulk RNA analyses of individual tumor biopsies. The activation states and transcriptional signatures of TAMs in GBM remain poorly characterized.MethodsWe comprehensively analyzed single-cell RNA-sequencing data, covering a total of 16,201 cells, to clarify the relative proportions of the immune cells infiltrating GBMs. The origin and TAM states in GBM were characterized using the expression profiles of differential marker genes. The vital transcription factors were examined by SCENIC analysis. By comparing the variable gene expression patterns in different clusters and cell types, we identified components and characteristics of TAMs unique to each GBM subtype. Meanwhile, we interrogated the correlation between SPI1 expression and macrophage infiltration in the TCGA-GBM dataset.ResultsThe expression patterns of TMEM119 and MHC-II can be utilized to distinguish the origin and activation states of TAMs. In TCGA-Mixed tumors, almost all TAMs were bone marrow-derived macrophages. The TAMs in TCGA-proneural tumors were characterized by primed microglia. A different composition was observed in TCGA-classical tumors, which were infiltrated by repressed microglia. Our results further identified SPI1 as a crucial regulon and potential immunotherapeutic target important for TAM maturation and polarization in GBM.ConclusionsWe describe the immune landscape of human GBM at a single-cell level and define a novel categorization scheme for TAMs in GBM. The immunotherapy against SPI1 would reprogram the immune environment of GBM and enhance the treatment effect of conventional chemotherapy drugs.


2021 ◽  
Author(s):  
Florian Schmidt ◽  
Bobby Ranjan ◽  
Quy Xiao Xuan Lin ◽  
Vaidehi Krishnan ◽  
Ignasius Joanito ◽  
...  

MotivationThe transcriptomic diversity of the hundreds of cell types in the human body can be analysed in unprecedented detail using single cell (SC) technologies. Though clustering of cellular transcriptomes is the default technique for defining cell types and subtypes, single cell clustering can be strongly influenced by technical variation. In fact, the prevalent unsupervised clustering algorithms can cluster cells by technical, rather than biological, variation.ResultsCompared to de novo (unsupervised) clustering methods, we demonstrate using multiple benchmarks that supervised clustering, which uses reference transcriptomes as a guide, is robust to batch effects. To leverage the advantages of supervised clustering, we present RCA2, a new, scalable, and broadly applicable version of our RCA algorithm. RCA2 provides a user-friendly framework for supervised clustering and downstream analysis of large scRNA-seq data sets. RCA2 can be seamlessly incorporated into existing algorithmic pipelines. It incorporates various new reference panels for human and mouse, supports generation of custom panels and uses efficient graph-based clustering and sparse data structures to ensure scalability. We demonstrate the applicability of RCA2 on SC data from human bone marrow, healthy PBMCs and PBMCs from COVID-19 patients. Importantly, RCA2 facilitates cell-type-specific QC, which we show is essential for accurate clustering of SC data from heterogeneous tissues. In the era of cohort-scale SC analysis, supervised clustering methods such as RCA2 will facilitate unified analysis of diverse SC datasets.AvailabilityRCA2 is implemented in R and is available at github.com/prabhakarlab/RCAv2


2021 ◽  
Author(s):  
Guoli Ji ◽  
Wujing Xuan ◽  
Yibo Zhuang ◽  
Lishan Ye ◽  
Sheng Zhu ◽  
...  

AbstractSingle-cell RNA-sequencing (scRNA-seq) has enabled transcriptome-wide profiling of gene expressions in individual cells. A myriad of computational methods have been proposed to learn cell-cell similarities and/or cluster cells, however, high variability and dropout rate inherent in scRNA-seq confounds reliable quantification of cell-cell associations based on the gene expression profile alone. Lately bioinformatics studies have emerged to capture key transcriptome information on alternative polyadenylation (APA) from standard scRNA-seq and revealed APA dynamics among cell types, suggesting the possibility of discerning cell identities with the APA profile. Complementary information at both layers of APA isoforms and genes creates great potential to develop cost-efficient approaches to dissect cell types based on multiple modalities derived from existing scRNA-seq data without changing experimental technologies. We proposed a toolkit called scLAPA for learning association for single-cell transcriptomics by combing single-cell profiling of gene expression and alternative polyadenylation derived from the same scRNA-seq data. We compared scLAPA with seven similarity metrics and five clustering methods using diverse scRNA-seq datasets. Comparative results showed that scLAPA is more effective and robust for learning cell-cell similarities and clustering cell types than competing methods. Moreover, with scLAPA we found two hidden subpopulations of peripheral blood mononuclear cells that were undetectable using the gene expression data alone. As a comprehensive toolkit, scLAPA provides a unique strategy to learn cell-cell associations, improve cell type clustering and discover novel cell types by augmentation of gene expression profiles with polyadenylation information, which can be incorporated in most existing scRNA-seq pipelines. scLAPA is available at https://github.com/BMILAB/scLAPA.


Author(s):  
Ming Tang ◽  
Yasin Kaymaz ◽  
Brandon Logeman ◽  
Stephen Eichhorn ◽  
ZhengZheng S. Liang ◽  
...  

AbstractMotivationOne major goal of single-cell RNA sequencing (scRNAseq) experiments is to identify novel cell types. With increasingly large scRNAseq datasets, unsupervised clustering methods can now produce detailed catalogues of transcriptionally distinct groups of cells in a sample. However, the interpretation of these clusters is challenging for both technical and biological reasons. Popular clustering algorithms are sensitive to parameter choices, and can produce different clustering solutions with even small changes in the number of principal components used, the k nearest neighbor, and the resolution parameters, among others.ResultsHere, we present a set of tools to evaluate cluster stability by subsampling, which can guide parameter choice and aid in biological interpretation. The R package scclusteval and the accompanying Snakemake workflow implement all steps of the pipeline: subsampling the cells, repeating the clustering with Seurat, and estimation of cluster stability using the Jaccard similarity index. The Snakemake workflow takes advantage of high-performance computing clusters and dispatches jobs in parallel to available CPUs to speed up the analysis. The scclusteval package provides functions to facilitate the analysis of the output, including a series of rich visualizations.AvailabilityR package scclusteval: https://github.com/crazyhottommy/scclusteval Snakemake workflow: https://github.com/crazyhottommy/[email protected], [email protected] informationSupplementary data are available at Bioinformatics online.


2017 ◽  
Author(s):  
Yuan Cao ◽  
Junjie Zhu ◽  
Guangchun Han ◽  
Peilin Jia ◽  
Zhongming Zhao

AbstractSummary: Single-cell RNA sequencing (scRNA-Seq) is quickly becoming a powerful tool for high-throughput transcriptomic analysis of cell states and dynamics. Both the number and quality of scRNA-Seq datasets have dramatically increased recently. So far, there is no database that comprehensively collects and curates scRNA-Seq data in humans. Here, we present scRNASeqDB, a database that includes almost all the currently available human single cell transcriptome datasets (n= 36) covering 71 human cell lines or types and 8910 samples. Our online web interface allows user to query and visualize expression profiles of the gene(s) of interest, search for genes that are expressed in different cell types or groups, or retrieve differentially expressed genes between cell types or groups. The scRNASeqDB is a valuable resource for single cell transcriptional studies.Availability: The database is available at https://bioinfo.uth.edu/scrnaseqdb/.Contact: [email protected]


Sign in / Sign up

Export Citation Format

Share Document