HGC: fast hierarchical clustering for large-scale single-cell data

Mapping Intimacies ◽

10.1101/2021.02.07.430106 ◽

2021 ◽

Author(s):

Ziheng Zou ◽

Kui Hua ◽

Xuegong Zhang

Keyword(s):

Single Cell ◽

Hierarchical Clustering ◽

Large Scale ◽

Nearest Neighbor ◽

Linear Time ◽

Fixed Number ◽

Large Datasets ◽

Clustering Methods ◽

Shared Nearest Neighbor ◽

Cell Data

AbstractClustering is a key step in revealing heterogeneities in single-cell data. Cell heterogeneity can be explored at different resolutions and the resulted varying cell states are inherently nested. However, most existing single-cell clustering methods output a fixed number of clusters without the hierarchical information. Classical hierarchical clustering provides dendrogram of cells, but cannot scale to large datasets due to the high computational complexity. We present HGC, a fast Hierarchical Graph-based Clustering method to address both problems. It combines the advantages of graph-based clustering and hierarchical clustering. On the shared nearest neighbor graph of cells, HGC constructs the hierarchical tree with linear time complexity. Experiments showed that HGC enables multiresolution exploration of the biological hierarchy underlying the data, achieves state-of-the-art accuracy on benchmark data, and can scale to large datasets. HGC is freely available for academic use at https://www.github.com/XuegongLab/[email protected], [email protected]

Download Full-text

mbkmeans: Fast clustering for single cell data using mini-batch k-means

PLoS Computational Biology ◽

10.1371/journal.pcbi.1008625 ◽

2021 ◽

Vol 17 (1) ◽

pp. e1008625

Author(s):

Stephanie C. Hicks ◽

Ruoxi Liu ◽

Yuwei Ni ◽

Elizabeth Purdom ◽

Davide Risso

Keyword(s):

Single Cell ◽

Clustering Algorithms ◽

Large Datasets ◽

Clustering Methods ◽

Cell Clustering ◽

Genome Wide ◽

Data Representations ◽

Computing Performance ◽

Cell Data ◽

Genome Wide Gene Expression

Single-cell RNA-Sequencing (scRNA-seq) is the most widely used high-throughput technology to measure genome-wide gene expression at the single-cell level. One of the most common analyses of scRNA-seq data detects distinct subpopulations of cells through the use of unsupervised clustering algorithms. However, recent advances in scRNA-seq technologies result in current datasets ranging from thousands to millions of cells. Popular clustering algorithms, such as k-means, typically require the data to be loaded entirely into memory and therefore can be slow or impossible to run with large datasets. To address this problem, we developed the mbkmeans R/Bioconductor package, an open-source implementation of the mini-batch k-means algorithm. Our package allows for on-disk data representations, such as the common HDF5 file format widely used for single-cell data, that do not require all the data to be loaded into memory at one time. We demonstrate the performance of the mbkmeans package using large datasets, including one with 1.3 million cells. We also highlight and compare the computing performance of mbkmeans against the standard implementation of k-means and other popular single-cell clustering methods. Our software package is available in Bioconductor at https://bioconductor.org/packages/mbkmeans.

Download Full-text

dropClust: Efficient clustering of ultra-large scRNA-seq data

10.1101/170308 ◽

2017 ◽

Cited By ~ 2

Author(s):

Debajyoti Sinha ◽

Akhilesh Kumar ◽

Himanshu Kumar ◽

Sanghamitra Bandyopadhyay ◽

Debarka Sengupta

Keyword(s):

Single Cell ◽

Large Scale ◽

Best Practice ◽

Clustering Algorithm ◽

Nearest Neighbor ◽

De Novo ◽

Single Cells ◽

Nearest Neighbor Search ◽

Locality Sensitive Hashing ◽

Clustering Methods

ABSTRACTDroplet based single cell transcriptomics has recently enabled parallel screening of tens of thousands of single cells. Clustering methods that scale for such high dimensional data without compromising accuracy are scarce. We exploit Locality Sensitive Hashing, an approximate nearest neighbor search technique to develop ade novoclustering algorithm for large-scale single cell data. On a number of real datasets, dropClust outperformed the existing best practice methods in terms of execution time, clustering accuracy and detectability of minor cell sub-types.

Download Full-text

Differential abundance testing on single-cell data using k-nearest neighbor graphs

Nature Biotechnology ◽

10.1038/s41587-021-01033-z ◽

2021 ◽

Author(s):

Emma Dann ◽

Neil C. Henderson ◽

Sarah A. Teichmann ◽

Michael D. Morgan ◽

John C. Marioni

Keyword(s):

Single Cell ◽

Nearest Neighbor ◽

K Nearest Neighbor ◽

Differential Abundance ◽

Cell Data

Download Full-text

EpiScanpy: integrated single-cell epigenomic analysis

10.1101/648097 ◽

2019 ◽

Cited By ~ 4

Author(s):

Anna Danese ◽

Maria L. Richter ◽

David S. Fischer ◽

Fabian J. Theis ◽

Maria Colomé-Tatché

Keyword(s):

Dna Methylation ◽

Single Cell ◽

Large Scale ◽

Feature Space ◽

Rna Seq ◽

Computational Framework ◽

Learning Techniques ◽

Multiple Feature ◽

The Many ◽

Cell Data

ABSTRACTEpigenetic single-cell measurements reveal a layer of regulatory information not accessible to single-cell transcriptomics, however single-cell-omics analysis tools mainly focus on gene expression data. To address this issue, we present epiScanpy, a computational framework for the analysis of single-cell DNA methylation and single-cell ATAC-seq data. EpiScanpy makes the many existing RNA-seq workflows from scanpy available to large-scale single-cell data from other -omics modalities. We introduce and compare multiple feature space constructions for epigenetic data and show the feasibility of common clustering, dimension reduction and trajectory learning techniques. We benchmark epiScanpy by interrogating different single-cell brain mouse atlases of DNA methylation, ATAC-seq and transcriptomics. We find that differentially methylated and differentially open markers between cell clusters enrich transcriptome-based cell type labels by orthogonal epigenetic information.

Download Full-text

gpps: an ILP-based approach for inferring cancer progression with mutation losses from single cell data

BMC Bioinformatics ◽

10.1186/s12859-020-03736-7 ◽

2020 ◽

Vol 21 (S1) ◽

Author(s):

Simone Ciccolella ◽

Mauricio Soto Gomez ◽

Murray D. Patterson ◽

Gianluca Della Vedova ◽

Iman Hajirasouliha ◽

...

Keyword(s):

Single Cell ◽

Cancer Progression ◽

False Negative ◽

Fixed Number ◽

Phylogeny Reconstruction ◽

Sequencing Data ◽

Progression Model ◽

Great Specificity ◽

Cell Data ◽

Tumor Phylogeny

Abstract Background Cancer progression reconstruction is an important development stemming from the phylogenetics field. In this context, the reconstruction of the phylogeny representing the evolutionary history presents some peculiar aspects that depend on the technology used to obtain the data to analyze: Single Cell DNA Sequencing data have great specificity, but are affected by moderate false negative and missing value rates. Moreover, there has been some recent evidence of back mutations in cancer: this phenomenon is currently widely ignored. Results We present a new tool, , that reconstructs a tumor phylogeny from Single Cell Sequencing data, allowing each mutation to be lost at most a fixed number of times. The General Parsimony Phylogeny from Single cell () tool is open source and available at https://github.com/AlgoLab/gpps. Conclusions provides new insights to the analysis of intra-tumor heterogeneity by proposing a new progression model to the field of cancer phylogeny reconstruction on Single Cell data.

Download Full-text

A Multi-Relational Hierarchical Clustering Algorithm Based on Shared Nearest Neighbor Similarity

2007 International Conference on Machine Learning and Cybernetics ◽

10.1109/icmlc.2007.4370836 ◽

2007 ◽

Cited By ~ 1

Author(s):

Jing-Feng Guo ◽

Yu-Yan Zhao ◽

Jing Li

Keyword(s):

Hierarchical Clustering ◽

Clustering Algorithm ◽

Nearest Neighbor ◽

Hierarchical Clustering Algorithm ◽

Shared Nearest Neighbor

Download Full-text

Large-Scale Multi-View Subspace Clustering in Linear Time

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i04.5867 ◽

2020 ◽

Vol 34 (04) ◽

pp. 4412-4419 ◽

Cited By ~ 3

Author(s):

Zhao Kang ◽

Wangtao Zhou ◽

Zhitong Zhao ◽

Junming Shao ◽

Meng Han ◽

...

Keyword(s):

Large Scale ◽

State Of The Art ◽

Linear Time ◽

Subspace Clustering ◽

Data Sets ◽

Clustering Methods ◽

Single View ◽

Novel Approach ◽

Points Of View ◽

Effectiveness And Efficiency

A plethora of multi-view subspace clustering (MVSC) methods have been proposed over the past few years. Researchers manage to boost clustering accuracy from different points of view. However, many state-of-the-art MVSC algorithms, typically have a quadratic or even cubic complexity, are inefficient and inherently difficult to apply at large scales. In the era of big data, the computational issue becomes critical. To fill this gap, we propose a large-scale MVSC (LMVSC) algorithm with linear order complexity. Inspired by the idea of anchor graph, we first learn a smaller graph for each view. Then, a novel approach is designed to integrate those graphs so that we can implement spectral clustering on a smaller graph. Interestingly, it turns out that our model also applies to single-view scenario. Extensive experiments on various large-scale benchmark data sets validate the effectiveness and efficiency of our approach with respect to state-of-the-art clustering methods.

Download Full-text

destiny: diffusion maps for large-scale single-cell data in R

Bioinformatics ◽

10.1093/bioinformatics/btv715 ◽

2015 ◽

Vol 32 (8) ◽

pp. 1241-1243 ◽

Cited By ~ 225

Author(s):

Philipp Angerer ◽

Laleh Haghverdi ◽

Maren Büttner ◽

Fabian J. Theis ◽

Carsten Marr ◽

...

Keyword(s):

Single Cell ◽

Large Scale ◽

Diffusion Maps ◽

Cell Data

Download Full-text

A Joint Deep Learning Model for Simultaneous Batch Effect Correction, Denoising and Clustering in Single-Cell Transcriptomics

10.1101/2020.09.23.310003 ◽

2020 ◽

Cited By ~ 1

Author(s):

Justin Lakkis ◽

David Wang ◽

Yuanchao Zhang ◽

Gang Hu ◽

Kui Wang ◽

...

Keyword(s):

Gene Expression ◽

Deep Learning ◽

Single Cell ◽

Large Scale ◽

Nearest Neighbor ◽

Learning Model ◽

Batch Effect ◽

Marker Genes ◽

Deep Learning Model ◽

Variable Genes

AbstractRecent development of single-cell RNA-seq (scRNA-seq) technologies has led to enormous biological discoveries. As the scale of scRNA-seq studies increases, a major challenge in analysis is batch effect, which is inevitable in studies involving human tissues. Most existing methods remove batch effect in a low-dimensional embedding space. Although useful for clustering, batch effect is still present in the gene expression space, leaving downstream gene-level analysis susceptible to batch effect. Recent studies have shown that batch effect correction in the gene expression space is much harder than in the embedding space. Popular methods such as Seurat3.0 rely on the mutual nearest neighbor (MNN) approach to remove batch effect in the gene expression space, but MNN can only analyze two batches at a time and it becomes computationally infeasible when the number of batches is large. Here we present CarDEC, a joint deep learning model that simultaneously clusters and denoises scRNA-seq data, while correcting batch effect both in the embedding and the gene expression space. Comprehensive evaluations spanning different species and tissues showed that CarDEC consistently outperforms scVI, DCA, and MNN. With CarDEC denoising, those non-highly variable genes offer as much signal for clustering as the highly variable genes, suggesting that CarDEC substantially boosted information content in scRNA-seq. We also showed that trajectory analysis using CarDEC’s denoised and batch corrected expression as input revealed marker genes and transcription factors that are otherwise obscured in the presence of batch effect. CarDEC is computationally fast, making it a desirable tool for large-scale scRNA-seq studies.

Download Full-text

Continuous visualization of differences between biological conditions in single-cell data

10.1101/337485 ◽

2018 ◽

Cited By ~ 1

Author(s):

Tyler J. Burns ◽

Garry P. Nolan ◽

Nikolay Samusik

Keyword(s):

Single Cell ◽

Nearest Neighbor ◽

Developmental Trajectory ◽

Functional Markers ◽

Mass Cytometry ◽

K Nearest Neighbor ◽

Cell Frequency ◽

Low Dimensional ◽

Marker Shift ◽

Cell Data

In high-dimensional single cell data, comparing changes in functional markers between conditions is typically done across manual or algorithm-derived partitions based on population-defining markers. Visualizations of these partitions is commonly done on low-dimensional embeddings (eg. t-SNE), colored by per-partition changes. Here, we provide an analysis and visualization tool that performs these comparisons across overlapping k-nearest neighbor (KNN) groupings. This allows one to color low-dimensional embeddings by marker changes without hard boundaries imposed by partitioning. We devised an objective optimization of k based on minimizing functional marker KNN imputation error. Proof-of-concept work visualized the exact location of an IL-7 responsive subset in a B cell developmental trajectory on a t-SNE map independent of clustering. Per-condition cell frequency analysis revealed that KNN is sensitive to detecting artifacts due to marker shift, and therefore can also be valuable in a quality control pipeline. Overall, we found that KNN groupings lead to useful multiple condition visualizations and efficiently extract a large amount of information from mass cytometry data. Our software is publicly available through the Bioconductor package Sconify.

Download Full-text