Robust clustering and interpretation of scRNA-seq data using reference component analysis

Mapping Intimacies ◽

10.1101/2021.02.16.431527 ◽

2021 ◽

Author(s):

Florian Schmidt ◽

Bobby Ranjan ◽

Quy Xiao Xuan Lin ◽

Vaidehi Krishnan ◽

Ignasius Joanito ◽

...

Keyword(s):

Single Cell ◽

De Novo ◽

Clustering Algorithms ◽

Cell Types ◽

Unsupervised Clustering ◽

Data Sets ◽

Clustering Methods ◽

Robust Clustering ◽

Supervised Clustering ◽

Downstream Analysis

MotivationThe transcriptomic diversity of the hundreds of cell types in the human body can be analysed in unprecedented detail using single cell (SC) technologies. Though clustering of cellular transcriptomes is the default technique for defining cell types and subtypes, single cell clustering can be strongly influenced by technical variation. In fact, the prevalent unsupervised clustering algorithms can cluster cells by technical, rather than biological, variation.ResultsCompared to de novo (unsupervised) clustering methods, we demonstrate using multiple benchmarks that supervised clustering, which uses reference transcriptomes as a guide, is robust to batch effects. To leverage the advantages of supervised clustering, we present RCA2, a new, scalable, and broadly applicable version of our RCA algorithm. RCA2 provides a user-friendly framework for supervised clustering and downstream analysis of large scRNA-seq data sets. RCA2 can be seamlessly incorporated into existing algorithmic pipelines. It incorporates various new reference panels for human and mouse, supports generation of custom panels and uses efficient graph-based clustering and sparse data structures to ensure scalability. We demonstrate the applicability of RCA2 on SC data from human bone marrow, healthy PBMCs and PBMCs from COVID-19 patients. Importantly, RCA2 facilitates cell-type-specific QC, which we show is essential for accurate clustering of SC data from heterogeneous tissues. In the era of cohort-scale SC analysis, supervised clustering methods such as RCA2 will facilitate unified analysis of diverse SC datasets.AvailabilityRCA2 is implemented in R and is available at github.com/prabhakarlab/RCAv2

Download Full-text

Enhancing droplet-based single-nucleus RNA-seq resolution using the semi-supervised machine learning classifier DIEM

10.1101/786285 ◽

2019 ◽

Cited By ~ 4

Author(s):

Marcus Alvarez ◽

Elior Rahmani ◽

Brandon Jew ◽

Kristina M. Garske ◽

Zong Miao ◽

...

Keyword(s):

Gene Expression ◽

Single Cell ◽

Cell Types ◽

Supervised Machine Learning ◽

Data Sets ◽

Rna Seq ◽

Novel Approach ◽

Single Nucleus ◽

Downstream Analysis

AbstractSingle-nucleus RNA sequencing (snRNA-seq) measures gene expression in individual nuclei instead of cells, allowing for unbiased cell type characterization in solid tissues. Contrary to single-cell RNA seq (scRNA-seq), we observe that snRNA-seq is commonly subject to contamination by high amounts of extranuclear background RNA, which can lead to identification of spurious cell types in downstream clustering analyses if overlooked. We present a novel approach to remove debris-contaminated droplets in snRNA-seq experiments, called Debris Identification using Expectation Maximization (DIEM). Our likelihood-based approach models the gene expression distribution of debris and cell types, which are estimated using EM. We evaluated DIEM using three snRNA-seq data sets: 1) human differentiating preadipocytes in vitro, 2) fresh mouse brain tissue, and 3) human frozen adipose tissue (AT) from six individuals. All three data sets showed various degrees of extranuclear RNA contamination. We observed that existing methods fail to account for contaminated droplets and led to spurious cell types. When compared to filtering using these state of the art methods, DIEM better removed droplets containing high levels of extranuclear RNA and led to higher quality clusters. Although DIEM was designed for snRNA-seq data, we also successfully applied DIEM to single-cell data. To conclude, our novel method DIEM removes debris-contaminated droplets from single-cell-based data fast and effectively, leading to cleaner downstream analysis. Our code is freely available for use at https://github.com/marcalva/diem.

Download Full-text

Single-cell RNA-seq data clustering: A survey with performance comparison study

Journal of Bioinformatics and Computational Biology ◽

10.1142/s0219720020400053 ◽

2020 ◽

Vol 18 (04) ◽

pp. 2040005

Author(s):

Ruiyi Li ◽

Jihong Guan ◽

Shuigeng Zhou

Keyword(s):

Single Cell ◽

Data Clustering ◽

Performance Metrics ◽

Clustering Algorithms ◽

Cell Types ◽

Performance Comparison ◽

Cellular Heterogeneity ◽

Clustering Methods ◽

Multiple Perspectives ◽

Underlying Mechanisms

Clustering analysis has been widely applied to single-cell RNA-sequencing (scRNA-seq) data to discover cell types and cell states. Algorithms developed in recent years have greatly helped the understanding of cellular heterogeneity and the underlying mechanisms of biological processes. However, these algorithms often use different techniques, were evaluated on different datasets and compared with some of their counterparts usually using different performance metrics. Consequently, there lacks an accurate and complete picture of their merits and demerits, which makes it difficult for users to select proper algorithms for analyzing their data. To fill this gap, we first do a review on the major existing scRNA-seq data clustering methods, and then conduct a comprehensive performance comparison among them from multiple perspectives. We consider 13 state of the art scRNA-seq data clustering algorithms, and collect 12 publicly available real scRNA-seq datasets from the existing works to evaluate and compare these algorithms. Our comparative study shows that the existing methods are very diverse in performance. Even the top-performance algorithms do not perform well on all datasets, especially those with complex structures. This suggests that further research is required to explore more stable, accurate, and efficient clustering algorithms for scRNA-seq data.

Download Full-text

A systematic performance evaluation of clustering methods for single-cell RNA-seq data

F1000Research ◽

10.12688/f1000research.15666.1 ◽

2018 ◽

Vol 7 ◽

pp. 1141 ◽

Cited By ~ 50

Author(s):

Angelo Duò ◽

Mark D. Robinson ◽

Charlotte Soneson

Keyword(s):

Performance Evaluation ◽

Single Cell ◽

Clustering Algorithms ◽

General Purpose ◽

Data Sets ◽

Consensus Clustering ◽

Rna Seq ◽

Clustering Methods ◽

Run Time ◽

The Stability

Subpopulation identification, usually via some form of unsupervised clustering, is a fundamental step in the analysis of many single-cell RNA-seq data sets. This has motivated the development and application of a broad range of clustering methods, based on various underlying algorithms. Here, we provide a systematic and extensible performance evaluation of 12 clustering algorithms, including both methods developed explicitly for scRNA-seq data and more general-purpose methods. The methods were evaluated using 9 publicly available scRNA-seq data sets as well as three simulations with varying degree of cluster separability. The same feature selection approaches were used for all methods, allowing us to focus on the investigation of the performance of the clustering algorithms themselves. We evaluated the ability of recovering known subpopulations, the stability and the run time of the methods. Additionally, we investigated whether the performance could be improved by generating consensus partitions from multiple individual clustering methods. We found substantial differences in the performance, run time and stability between the methods, with SC3 and Seurat showing the most favorable results. Additionally, we found that consensus clustering typically did not improve the performance compared to the best of the combined methods, but that several of the top-performing methods already perform some type of consensus clustering. The R scripts providing an extensible framework for the evaluation of new methods and data sets are available on GitHub (https://github.com/markrobinsonuzh/scRNAseq_clustering_comparison).

Download Full-text

A Review article on Semi- Supervised Clustering Framework for High Dimensional Data

International Journal of Scientific Research in Computer Science Engineering and Information Technology ◽

10.32628/cseit195410 ◽

2019 ◽

pp. 102-108

Author(s):

M. Pavithra ◽

R. M. S. Parvathi

Keyword(s):

Expert Knowledge ◽

Cluster Formation ◽

Clustering Algorithms ◽

Ensemble Member ◽

Unsupervised Clustering ◽

Outcome Variable ◽

High Dimensional ◽

Clustering Methods ◽

Data Set ◽

Supervised Clustering

Cluster analysis methods seek to partition a data set into homogeneous subgroups. It is useful in a wide variety of applications, including document processing and modern genetics. Conventional clustering methods are unsupervised, meaning that there is no outcome variable nor is anything known about the relationship between the observations in the data set. In many situations, however, information about the clusters is available in addition to the values of the features [2]. For example, the cluster labels of some observations may be known, or certain observations may be known to belong to the same cluster. In other cases, one may wish to identify clusters that are associated with a particular outcome variable. This review describes several clustering algorithms (known as �semi-supervised clustering� methods) that can be applied in these situations [3]. The majority of these methods are modifications of the popular k-means clustering method, and several of them will be described in detail. A brief description of some other semi-supervised clustering algorithms is also provided. Cluster formation has three types as supervised clustering, unsupervised clustering and semi supervised. This paper reviews traditional and state-of-the-art methods of clustering [1]. Clustering algorithms are based on active learning, with ensemble clustering-means algorithm, data streams with flock, fuzzy clustering for shape annotations, Incremental semi supervised clustering, Weakly supervised clustering, with minimum labeled data, self-organizing based on neural networks. Incremental semi-supervised clustering ensemble framework (ISSCE) which makes utilization of the advantage of the arbitrary subspace method, the limitation spread approach, the proposed incremental ensemble member choice process, and the normalized cut algorithm to perform high dimensional information clustering [4]. Semi-supervised clustering employs limited supervision in the form of labeled instances or pairwise instance constraints to aid unsupervised clustering and often significantly improves the clustering performance. Despite the vast amount of expert knowledge spent on this problem, most existing work is not designed for handling high-dimensional sparse data.

Download Full-text

Evaluating single-cell cluster stability using the Jaccard similarity index

Bioinformatics ◽

10.1093/bioinformatics/btaa956 ◽

2020 ◽

Author(s):

Ming Tang ◽

Yasin Kaymaz ◽

Brandon L Logeman ◽

Stephen Eichhorn ◽

Zhengzheng S Liang ◽

...

Keyword(s):

Single Cell ◽

Nearest Neighbor ◽

Clustering Algorithms ◽

Similarity Index ◽

Cell Types ◽

R Package ◽

Clustering Methods ◽

K Nearest Neighbor ◽

Jaccard Similarity ◽

Cluster Stability

Abstract Motivation One major goal of single-cell RNA sequencing (scRNAseq) experiments is to identify novel cell types. With increasingly large scRNAseq datasets, unsupervised clustering methods can now produce detailed catalogues of transcriptionally distinct groups of cells in a sample. However, the interpretation of these clusters is challenging for both technical and biological reasons. Popular clustering algorithms are sensitive to parameter choices, and can produce different clustering solutions with even small changes in the number of principal components used, the k nearest neighbor and the resolution parameters, among others. Results Here, we present a set of tools to evaluate cluster stability by subsampling, which can guide parameter choice and aid in biological interpretation. The R package scclusteval and the accompanying Snakemake workflow implement all steps of the pipeline: subsampling the cells, repeating the clustering with Seurat and estimation of cluster stability using the Jaccard similarity index and providing rich visualizations. Availabilityand implementation R package scclusteval: https://github.com/crazyhottommy/scclusteval Snakemake workflow: https://github.com/crazyhottommy/pyflow_seuratv3_parameter Tutorial: https://crazyhottommy.github.io/EvaluateSingleCellClustering/.

Download Full-text

A systematic performance evaluation of clustering methods for single-cell RNA-seq data

F1000Research ◽

10.12688/f1000research.15666.3 ◽

2020 ◽

Vol 7 ◽

pp. 1141 ◽

Cited By ~ 1

Author(s):

Angelo Duò ◽

Mark D. Robinson ◽

Charlotte Soneson

Keyword(s):

Performance Evaluation ◽

Single Cell ◽

Clustering Algorithms ◽

General Purpose ◽

Data Sets ◽

Consensus Clustering ◽

Rna Seq ◽

Clustering Methods ◽

Link Type ◽

Run Time

Subpopulation identification, usually via some form of unsupervised clustering, is a fundamental step in the analysis of many single-cell RNA-seq data sets. This has motivated the development and application of a broad range of clustering methods, based on various underlying algorithms. Here, we provide a systematic and extensible performance evaluation of 14 clustering algorithms implemented in R, including both methods developed explicitly for scRNA-seq data and more general-purpose methods. The methods were evaluated using nine publicly available scRNA-seq data sets as well as three simulations with varying degree of cluster separability. The same feature selection approaches were used for all methods, allowing us to focus on the investigation of the performance of the clustering algorithms themselves. We evaluated the ability of recovering known subpopulations, the stability and the run time and scalability of the methods. Additionally, we investigated whether the performance could be improved by generating consensus partitions from multiple individual clustering methods. We found substantial differences in the performance, run time and stability between the methods, with SC3 and Seurat showing the most favorable results. Additionally, we found that consensus clustering typically did not improve the performance compared to the best of the combined methods, but that several of the top-performing methods already perform some type of consensus clustering. All the code used for the evaluation is available on GitHub (https://github.com/markrobinsonuzh/scRNAseq_clustering_comparison). In addition, an R package providing access to data and clustering results, thereby facilitating inclusion of new methods and data sets, is available from Bioconductor (https://bioconductor.org/packages/DuoClustering2018).

Download Full-text

A systematic performance evaluation of clustering methods for single-cell RNA-seq data

F1000Research ◽

10.12688/f1000research.15666.2 ◽

2018 ◽

Vol 7 ◽

pp. 1141 ◽

Cited By ~ 62

Author(s):

Angelo Duò ◽

Mark D. Robinson ◽

Charlotte Soneson

Keyword(s):

Performance Evaluation ◽

Single Cell ◽

Clustering Algorithms ◽

General Purpose ◽

Data Sets ◽

Consensus Clustering ◽

Rna Seq ◽

Clustering Methods ◽

Link Type ◽

Run Time

Subpopulation identification, usually via some form of unsupervised clustering, is a fundamental step in the analysis of many single-cell RNA-seq data sets. This has motivated the development and application of a broad range of clustering methods, based on various underlying algorithms. Here, we provide a systematic and extensible performance evaluation of 14 clustering algorithms implemented in R, including both methods developed explicitly for scRNA-seq data and more general-purpose methods. The methods were evaluated using nine publicly available scRNA-seq data sets as well as three simulations with varying degree of cluster separability. The same feature selection approaches were used for all methods, allowing us to focus on the investigation of the performance of the clustering algorithms themselves. We evaluated the ability of recovering known subpopulations, the stability and the run time and scalability of the methods. Additionally, we investigated whether the performance could be improved by generating consensus partitions from multiple individual clustering methods. We found substantial differences in the performance, run time and stability between the methods, with SC3 and Seurat showing the most favorable results. Additionally, we found that consensus clustering typically did not improve the performance compared to the best of the combined methods, but that several of the top-performing methods already perform some type of consensus clustering. All the code used for the evaluation is available on GitHub (https://github.com/markrobinsonuzh/scRNAseq_clustering_comparison). In addition, an R package providing access to data and clustering results, thereby facilitating inclusion of new methods and data sets, is available from Bioconductor (https://bioconductor.org/packages/DuoClustering2018).

Download Full-text

Matrix factorization and transfer learning uncover regulatory biology across multiple single-cell ATAC-seq data sets

10.1101/2020.01.30.927129 ◽

2020 ◽

Author(s):

Rossin Erbe ◽

Michael D. Kessler ◽

Alexander V. Favorov ◽

Hariharan Easwaran ◽

Daria A. Gaykalova ◽

...

Keyword(s):

Single Cell ◽

Transfer Learning ◽

Matrix Factorization ◽

Cell Types ◽

Data Sets ◽

Analysis Framework ◽

Learning Program ◽

Robust Clustering ◽

Factorization Algorithm ◽

Analysis Methods

AbstractWhile single-cell ATAC-seq analysis methods allow for robust clustering of cell types, the question of how to integrate multiple scATAC-seq data sets and/or sequencing modalities is still open. We present an analysis framework that enables such integration by applying the CoGAPS Matrix Factorization algorithm and the projectR transfer learning program to identify common regulatory patterns across scATAC-seq data sets. Using publicly available scATAC-seq data, we find patterns that accurately characterize cell types both within and across data sets. Furthermore, we demonstrate that these patterns are both consistent with current biological understanding and reflective of novel regulatory biology.

Download Full-text

A New semi-supervised clustering for incomplete data

Journal of Intelligent & Fuzzy Systems ◽

10.3233/jifs-189744 ◽

2021 ◽

pp. 1-13

Author(s):

Sonia Goel ◽

Meena Tushir

Keyword(s):

Incomplete Data ◽

Missing Values ◽

Clustering Algorithms ◽

Complete Data ◽

Unlabeled Data ◽

Misclassification Rate ◽

Data Sets ◽

Clustering Methods ◽

Data Set ◽

Supervised Clustering

Semi-supervised clustering technique partitions the unlabeled data based on prior knowledge of labeled data. Most of the semi-supervised clustering algorithms exist only for the clustering of complete data, i.e., the data sets with no missing features. In this paper, an effort has been made to check the effectiveness of semi-supervised clustering when applied to incomplete data sets. The novelty of this approach is that it considers the missing features along with available knowledge (labels) of the data set. The linear interpolation imputation technique initially imputes the missing features of the data set, thus completing the data set. A semi-supervised clustering is now employed on this complete data set, and missing features are regularly updated within the clustering process. In the proposed work, the labeled percentage range used is 30, 40, 50, and 60% of the total data. Data is further altered by arbitrarily eliminating certain features of its components, which makes the data incomplete with partial labeling. The proposed algorithm utilizes both labeled and unlabeled data, along with certain missing values in the data. The proposed algorithm is evaluated using three performance indices, namely the misclassification rate, random index metric, and error rate. Despite the additional missing features, the proposed algorithm has been successfully implemented on real data sets and showed better/competing results than well-known standard semi-supervised clustering methods.

Download Full-text

A completely parameter-free method for graph-based single cell RNA-seq clustering

10.1101/2021.07.15.452521 ◽

2021 ◽

Author(s):

Maryam Zand ◽

Jianhua Ruan

Keyword(s):

Single Cell ◽

Cell Population ◽

Nearest Neighbor ◽

Expression Profiles ◽

Clustering Algorithms ◽

Cell Types ◽

Clustering Methods ◽

K Nearest Neighbor ◽

Synthetic Datasets ◽

Almost All

Single-cell RNA sequencing (scRNAseq) offers an unprecedented potential for scrutinizing complex biological systems at single cell resolution. One of the most important applications of scRNAseq is to cluster cells into groups of similar expression profiles, which allows unsupervised identification of novel cell subtypes. While many clustering algorithms have been tested towards this goal, graph-based algorithms appear to be the most effective, due to their ability to accommodate the sparsity of the data, as well as the complex topology of the cell population. An integral part of almost all such clustering methods is the construction of a k-nearest-neighbor (KNN) network, and the choice of k, implicitly or explicitly, can have a profound impact on the density distribution of the graph and the structure of the resulting clusters, as well as the resolution of clusters that one can successfully identify from the data. In this work, we propose a fairly simple but robust approach to estimate the best k for constructing the KNN graph while simultaneously identifying the optimal clustering structure from the graph. Our method, named scQcut, employs a topology-based criterion to guide the construction of KNN graph, and then applies an efficient modularity-based community discovery algorithm to predict robust cell clusters. The results obtained from applying scQcut on a large number of real and synthetic datasets demonstrated that scQcut-which does not require any user-tuned parameters-outperformed several popular state-of-the-art clustering methods in terms of clustering accuracy and the ability to correctly identify rare cell types. The promising results indicate that an accurate approximation of the parameter k, which determines the topology of the network, is a crucial element of a successful graph-based clustering method to recover the final community structure of the cell population.

Download Full-text