Scalable Clustering with Supervised Linkage Methods

Abstract Motivation New single-cell technologies continue to fuel the explosive growth in the scale of heterogeneous single-cell data. However, existing computational methods are inadequately scalable to large datasets and therefore cannot uncover the complex cellular heterogeneity. Results We introduce a highly scalable graph-based clustering algorithm PARC—Phenotyping by Accelerated Refined Community-partitioning—for large-scale, high-dimensional single-cell data (>1 million cells). Using large single-cell flow and mass cytometry, RNA-seq and imaging-based biophysical data, we demonstrate that PARC consistently outperforms state-of-the-art clustering algorithms without subsampling of cells, including Phenograph, FlowSOM and Flock, in terms of both speed and ability to robustly detect rare cell populations. For example, PARC can cluster a single-cell dataset of 1.1 million cells within 13 min, compared with >2 h for the next fastest graph-clustering algorithm. Our work presents a scalable algorithm to cope with increasingly large-scale single-cell analysis. Availability and implementation https://github.com/ShobiStassen/PARC. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

PARC: ultrafast and accurate clustering of phenotypic data of millions of single cells

10.1101/765628 ◽

2019 ◽

Author(s):

Shobana V. Stassen ◽

Dickson M. D. Siu ◽

Kelvin C. M. Lee ◽

Joshua W. K. Ho ◽

Hayden K. H. So ◽

...

Keyword(s):

Single Cell ◽

Large Scale ◽

Clustering Algorithm ◽

Single Cells ◽

Clustering Algorithms ◽

Cell Mass ◽

Cellular Heterogeneity ◽

Phenotypic Data ◽

Data Set ◽

Cell Data

AbstractMotivationNew single-cell technologies continue to fuel the explosive growth in the scale of heterogeneous single-cell data. However, existing computational methods are inadequately scalable to large datasets and therefore cannot uncover the complex cellular heterogeneity.ResultsWe introduce a highly scalable graph-based clustering algorithm PARC - phenotyping by accelerated refined community-partitioning – for ultralarge-scale, high-dimensional single-cell data (> 1 million cells). Using large single cell mass cytometry, RNA-seq and imaging-based biophysical data, we demonstrate that PARC consistently outperforms state-of-the-art clustering algorithms without sub-sampling of cells, including Phenograph, FlowSOM, and Flock, in terms of both speed and ability to robustly detect rare cell populations. For example, PARC can cluster a single cell data set of 1.1M cells within 13 minutes, compared to >2 hours to the next fastest graph-clustering algorithm, Phenograph. Our work presents a scalable algorithm to cope with increasingly large-scale single-cell analysis.Availability and Implementationhttps://github.com/ShobiStassen/PARC

Download Full-text

Impact of data preprocessing on cell-type clustering based on single-cell RNA-seq data

BMC Bioinformatics ◽

10.1186/s12859-020-03797-8 ◽

2020 ◽

Vol 21 (1) ◽

Author(s):

Chunxiang Wang ◽

Xin Gao ◽

Juntao Liu

Keyword(s):

Single Cell ◽

Clustering Algorithm ◽

Clustering Algorithms ◽

Data Preprocessing ◽

Cell Types ◽

Rna Seq ◽

Cell Type ◽

Preprocessing Method ◽

Cell Clustering ◽

Cell Gene Expression

Abstract Background Advances in single-cell RNA-seq technology have led to great opportunities for the quantitative characterization of cell types, and many clustering algorithms have been developed based on single-cell gene expression. However, we found that different data preprocessing methods show quite different effects on clustering algorithms. Moreover, there is no specific preprocessing method that is applicable to all clustering algorithms, and even for the same clustering algorithm, the best preprocessing method depends on the input data. Results We designed a graph-based algorithm, SC3-e, specifically for discriminating the best data preprocessing method for SC3, which is currently the most widely used clustering algorithm for single cell clustering. When tested on eight frequently used single-cell RNA-seq data sets, SC3-e always accurately selects the best data preprocessing method for SC3 and therefore greatly enhances the clustering performance of SC3. Conclusion The SC3-e algorithm is practically powerful for discriminating the best data preprocessing method, and therefore largely enhances the performance of cell-type clustering of SC3. It is expected to play a crucial role in the related studies of single-cell clustering, such as the studies of human complex diseases and discoveries of new cell types.

Download Full-text

Genetic Programming for Evolving Similarity Functions for Clustering: Representations and Analysis

Evolutionary Computation ◽

10.1162/evco_a_00264 ◽

2020 ◽

Vol 28 (4) ◽

pp. 531-561 ◽

Cited By ~ 1

Author(s):

Andrew Lensen ◽

Bing Xue ◽

Mengjie Zhang

Keyword(s):

Genetic Programming ◽

Clustering Algorithm ◽

Clustering Algorithms ◽

Similarity Measures ◽

Small Subset ◽

Similarity Functions ◽

New Approach ◽

Performance Improvements ◽

Consistent Performance ◽

High Dimensional Datasets

Clustering is a difficult and widely studied data mining task, with many varieties of clustering algorithms proposed in the literature. Nearly all algorithms use a similarity measure such as a distance metric (e.g., Euclidean distance) to decide which instances to assign to the same cluster. These similarity measures are generally predefined and cannot be easily tailored to the properties of a particular dataset, which leads to limitations in the quality and the interpretability of the clusters produced. In this article, we propose a new approach to automatically evolving similarity functions for a given clustering algorithm by using genetic programming. We introduce a new genetic programming-based method which automatically selects a small subset of features (feature selection) and then combines them using a variety of functions (feature construction) to produce dynamic and flexible similarity functions that are specifically designed for a given dataset. We demonstrate how the evolved similarity functions can be used to perform clustering using a graph-based representation. The results of a variety of experiments across a range of large, high-dimensional datasets show that the proposed approach can achieve higher and more consistent performance than the benchmark methods. We further extend the proposed approach to automatically produce multiple complementary similarity functions by using a multi-tree approach, which gives further performance improvements. We also analyse the interpretability and structure of the automatically evolved similarity functions to provide insight into how and why they are superior to standard distance metrics.

Download Full-text

A case study on the detailed reproducibility of a human cell atlas project

10.1101/467993 ◽

2018 ◽

Author(s):

Kui Hua ◽

Xuegong Zhang

Keyword(s):

Single Cell ◽

Human Cell ◽

Scientific Discovery ◽

Cell Types ◽

The Future ◽

Reproduction Study ◽

High Flexibility ◽

Cell Data ◽

Different Levels

AbstractReproducibility is a defining feature of a scientific discovery. Reproducibility can be at different levels for different types of study. The purpose of the Human Cell Atlas (HCA) project is to build maps of molecular signatures of all human cell types and states to serve as references for future discoveries. Constructing such a complex reference atlas must involve the assembly and aggregation of data from multiple labs, probably generated with different technologies. It has much higher requirements on reproducibility than individual research projects. To add another layer of complexity, the bioinformatics procedures involved for single-cell data have high flexibility and diversity. There are many factors in the processing and analysis of single-cell RNA-seq data that can shape the final results in different ways. To study what levels of reproducibility can be reached in current practices, we conducted a detailed reproduction study for a well-documented recent publication on the atlas of human blood dendritic cells as an example to break down the bioinformatics steps and factors that are crucial for the reproducibility at different levels. We found that the major scientific discovery can be well reproduced after some efforts, but there are also some differences in some details that may cause uncertainty in the future reference. This study provides a detailed case observation on the on-going discussions of the type of standards the HCA community should take when releasing data and publications to guarantee the reproducibility and reliability of the future atlas.

Download Full-text

Genetic Programming for Evolving Similarity Functions for Clustering: Representations and Analysis

10.26686/wgtn.13058777 ◽

2020 ◽

Author(s):

Andrew Lensen ◽

Bing Xue ◽

Mengjie Zhang

Keyword(s):

Genetic Programming ◽

Clustering Algorithm ◽

Clustering Algorithms ◽

Similarity Measures ◽

Small Subset ◽

Similarity Functions ◽

New Approach ◽

Performance Improvements ◽

Consistent Performance ◽

High Dimensional Datasets

Clustering is a difficult and widely studied data mining task, with many varieties of clustering algorithms proposed in the literature. Nearly all algorithms use a similarity measure such as a distance metric (e.g., Euclidean distance) to decide which instances to assign to the same cluster. These similarity measures are generally predefined and cannot be easily tailored to the properties of a particular dataset, which leads to limitations in the quality and the interpretability of the clusters produced. In this article, we propose a new approach to automatically evolving similarity functions for a given clustering algorithm by using genetic programming. We introduce a new genetic programming-based method which automatically selects a small subset of features (feature selection) and then combines them using a variety of functions (feature construction) to produce dynamic and flexible similarity functions that are specifically designed for a given dataset. We demonstrate how the evolved similarity functions can be used to perform clustering using a graph-based representation. The results of a variety of experiments across a range of large, high-dimensional datasets show that the proposed approach can achieve higher and more consistent performance than the benchmark methods. We further extend the proposed approach to automatically produce multiple complementary similarity functions by using a multi-tree approach, which gives further performance improvements. We also analyse the interpretability and structure of the automatically evolved similarity functions to provide insight into how and why they are superior to standard distance metrics.

Download Full-text

Comparison Between UMAP and t-SNE for Multiplex-Immunofluorescence Derived Single-Cell Data from Tissue Sections

10.1101/549659 ◽

2019 ◽

Cited By ~ 1

Author(s):

Duoduo Wu ◽

Joe Yeong Poh Sheng ◽

Grace Tan Su-En ◽

Marion Chevrier ◽

Josh Loh Jie Hua ◽

...

Keyword(s):

Single Cell ◽

Clustering Algorithm ◽

Cell Types ◽

Immune Markers ◽

Tissue Samples ◽

Tissue Sections ◽

Reduced Dimensions ◽

Dimensionality Reduction Technique ◽

Cell Data ◽

Worse Prognosis

AbstractUsing human hepatocellular carcinoma (HCC) tissue samples stained with seven immune markers including one nuclear counterstain, we compared and evaluated the use of a new dimensionality reduction technique called Uniform Manifold Approximation and Projection (UMAP), as an alternative to t-Distributed Stochastic Neighbor Embedding (t-SNE) in analysing multiplex-immunofluorescence (mIF) derived single-cell data. We adopted an unsupervised clustering algorithm called FlowSOM to identify eight major cell types present in human HCC tissues. UMAP and t-SNE were ran independently on the dataset to qualitatively compare the distribution of clustered cell types in both reduced dimensions. Our comparison shows that UMAP is superior in runtime. Both techniques provide similar arrangements of cell clusters, with the key difference being UMAP’s extensive characteristic branching. Most interestingly, UMAP’s branching was able to highlight biological lineages, especially in identifying potential hybrid tumour cells (HTC). Survival analysis shows patients with higher proportion of HTC have a worse prognosis (p-value = 0.019). We conclude that both techniques are similar in their visualisation capabilities, but UMAP has a clear advantage over t-SNE in runtime, making it highly plausible to employ UMAP as an alternative to t-SNE in mIF data analysis.

Download Full-text

Metagenome sequence clustering with hash-based canopies

Journal of Bioinformatics and Computational Biology ◽

10.1142/s0219720017400066 ◽

2017 ◽

Vol 15 (06) ◽

pp. 1740006 ◽

Cited By ~ 6

Author(s):

Mohammad Arifur Rahman ◽

Nathan LaPierre ◽

Huzefa Rangwala ◽

Daniel Barbara

Keyword(s):

Large Scale ◽

Clustering Algorithm ◽

State Of The Art ◽

Sequence Data ◽

Clustering Algorithms ◽

Clustering Methods ◽

Operational Taxonomic Units ◽

Sequence Clustering ◽

Scalable Clustering ◽

Metagenome Sequence

Metagenomics is the collective sequencing of co-existing microbial communities which are ubiquitous across various clinical and ecological environments. Due to the large volume and random short sequences (reads) obtained from community sequences, analysis of diversity, abundance and functions of different organisms within these communities are challenging tasks. We present a fast and scalable clustering algorithm for analyzing large-scale metagenome sequence data. Our approach achieves efficiency by partitioning the large number of sequence reads into groups (called canopies) using hashing. These canopies are then refined by using state-of-the-art sequence clustering algorithms. This canopy-clustering (CC) algorithm can be used as a pre-processing phase for computationally expensive clustering algorithms. We use and compare three hashing schemes for canopy construction with five popular and state-of-the-art sequence clustering methods. We evaluate our clustering algorithm on synthetic and real-world 16S and whole metagenome benchmarks. We demonstrate the ability of our proposed approach to determine meaningful Operational Taxonomic Units (OTU) and observe significant speedup with regards to run time when compared to different clustering algorithms. We also make our source code publicly available on Github. a

Download Full-text

mbkmeans: Fast clustering for single cell data using mini-batch k-means

PLoS Computational Biology ◽

10.1371/journal.pcbi.1008625 ◽

2021 ◽

Vol 17 (1) ◽

pp. e1008625

Author(s):

Stephanie C. Hicks ◽

Ruoxi Liu ◽

Yuwei Ni ◽

Elizabeth Purdom ◽

Davide Risso

Keyword(s):

Single Cell ◽

Clustering Algorithms ◽

Large Datasets ◽

Clustering Methods ◽

Cell Clustering ◽

Genome Wide ◽

Data Representations ◽

Computing Performance ◽

Cell Data ◽

Genome Wide Gene Expression

Single-cell RNA-Sequencing (scRNA-seq) is the most widely used high-throughput technology to measure genome-wide gene expression at the single-cell level. One of the most common analyses of scRNA-seq data detects distinct subpopulations of cells through the use of unsupervised clustering algorithms. However, recent advances in scRNA-seq technologies result in current datasets ranging from thousands to millions of cells. Popular clustering algorithms, such as k-means, typically require the data to be loaded entirely into memory and therefore can be slow or impossible to run with large datasets. To address this problem, we developed the mbkmeans R/Bioconductor package, an open-source implementation of the mini-batch k-means algorithm. Our package allows for on-disk data representations, such as the common HDF5 file format widely used for single-cell data, that do not require all the data to be loaded into memory at one time. We demonstrate the performance of the mbkmeans package using large datasets, including one with 1.3 million cells. We also highlight and compare the computing performance of mbkmeans against the standard implementation of k-means and other popular single-cell clustering methods. Our software package is available in Bioconductor at https://bioconductor.org/packages/mbkmeans.

Download Full-text

Model-based branching point detection in single-cell data by K-Branches clustering

10.1101/094532 ◽

2016 ◽

Author(s):

Nikolaos K. Chlis ◽

F. Alexander Wolf ◽

Fabian J. Theis

Keyword(s):

Single Cell ◽

Clustering Algorithm ◽

Rna Seq ◽

Mouse Blastocyst ◽

Blastocyst Development ◽

Gap Statistic ◽

Branching Point ◽

Point Detection ◽

Lineage Trees ◽

Cell Data

MotivationThe identification of heterogeneities in cell populations by utilizing single-cell technologies such as single-cell RNA-Seq, enables inference of cellular development and lineage trees. Several methods have been proposed for such inference from high-dimensional single-cell data. They typically assign each cell to a branch in a differentiation trajectory. However, they commonly assume specific geometries such as tree-like developmental hierarchies and lack statistically sound methods to decide on the number of branching events.ResultsWe present K-Branches, a solution to the above problem by locally fitting half-lines to single-cell data, introducing a clustering algorithm similar to K-Means. These halflines are proxies for branches in the differentiation trajectory of cells. We propose a modified version of the GAP statistic for model selection, in order to decide on the number of lines that best describe the data locally. In this manner, we identify the location and number of subgroups of cells that are associated with branching events and full differentiation, respectively. We evaluate the performance of our method on single-cell RNA-Seq data describing the differentiation of myeloid progenitors during hematopoiesis, single-cell qPCR data of mouse blastocyst development and artificial data.AvailabilityAn R implementation of K-Branches is freely available at https://github.com/theislab/[email protected]

Download Full-text