PARC: ultrafast and accurate clustering of phenotypic data of millions of single cells

Shobana V Stassen; Dickson M D Siu; Kelvin C M Lee; Joshua W K Ho; Hayden K H So; Kevin K Tsia

doi:10.1093/bioinformatics/btaa042

PARC: ultrafast and accurate clustering of phenotypic data of millions of single cells

Bioinformatics ◽

10.1093/bioinformatics/btaa042 ◽

2020 ◽

Vol 36 (9) ◽

pp. 2778-2786 ◽

Cited By ~ 5

Author(s):

Shobana V Stassen ◽

Dickson M D Siu ◽

Kelvin C M Lee ◽

Joshua W K Ho ◽

Hayden K H So ◽

...

Keyword(s):

Single Cell ◽

Large Scale ◽

Clustering Algorithm ◽

Single Cells ◽

Clustering Algorithms ◽

Cellular Heterogeneity ◽

Supplementary Information ◽

Phenotypic Data ◽

Scalable Algorithm ◽

Cell Data

Abstract Motivation New single-cell technologies continue to fuel the explosive growth in the scale of heterogeneous single-cell data. However, existing computational methods are inadequately scalable to large datasets and therefore cannot uncover the complex cellular heterogeneity. Results We introduce a highly scalable graph-based clustering algorithm PARC—Phenotyping by Accelerated Refined Community-partitioning—for large-scale, high-dimensional single-cell data (>1 million cells). Using large single-cell flow and mass cytometry, RNA-seq and imaging-based biophysical data, we demonstrate that PARC consistently outperforms state-of-the-art clustering algorithms without subsampling of cells, including Phenograph, FlowSOM and Flock, in terms of both speed and ability to robustly detect rare cell populations. For example, PARC can cluster a single-cell dataset of 1.1 million cells within 13 min, compared with >2 h for the next fastest graph-clustering algorithm. Our work presents a scalable algorithm to cope with increasingly large-scale single-cell analysis. Availability and implementation https://github.com/ShobiStassen/PARC. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

PARC: ultrafast and accurate clustering of phenotypic data of millions of single cells

10.1101/765628 ◽

2019 ◽

Author(s):

Shobana V. Stassen ◽

Dickson M. D. Siu ◽

Kelvin C. M. Lee ◽

Joshua W. K. Ho ◽

Hayden K. H. So ◽

...

Keyword(s):

Single Cell ◽

Large Scale ◽

Clustering Algorithm ◽

Single Cells ◽

Clustering Algorithms ◽

Cell Mass ◽

Cellular Heterogeneity ◽

Phenotypic Data ◽

Data Set ◽

Cell Data

AbstractMotivationNew single-cell technologies continue to fuel the explosive growth in the scale of heterogeneous single-cell data. However, existing computational methods are inadequately scalable to large datasets and therefore cannot uncover the complex cellular heterogeneity.ResultsWe introduce a highly scalable graph-based clustering algorithm PARC - phenotyping by accelerated refined community-partitioning – for ultralarge-scale, high-dimensional single-cell data (> 1 million cells). Using large single cell mass cytometry, RNA-seq and imaging-based biophysical data, we demonstrate that PARC consistently outperforms state-of-the-art clustering algorithms without sub-sampling of cells, including Phenograph, FlowSOM, and Flock, in terms of both speed and ability to robustly detect rare cell populations. For example, PARC can cluster a single cell data set of 1.1M cells within 13 minutes, compared to >2 hours to the next fastest graph-clustering algorithm, Phenograph. Our work presents a scalable algorithm to cope with increasingly large-scale single-cell analysis.Availability and Implementationhttps://github.com/ShobiStassen/PARC

Download Full-text

Single cell network analysis with a mixture of Nested Effects Models

10.1101/258202 ◽

2018 ◽

Author(s):

Martin Pirkl ◽

Niko Beerenwinkel

Keyword(s):

Single Cell ◽

New Technologies ◽

Single Cells ◽

R Package ◽

Supplementary Information ◽

Data Sets ◽

Cell Network ◽

A Cell ◽

Supplementary Material ◽

Cell Data

AbstractMotivationNew technologies allow for the elaborate measurement of different traits of single cells. These data promise to elucidate intra-cellular networks in unprecedented detail and further help to improve treatment of diseases like cancer. However, cell populations can be very heterogeneous.ResultsWe developed a mixture of Nested Effects Models (M&NEM) for single-cell data to simultaneously identify different cellular sub-populations and their corresponding causal networks to explain the heterogeneity in a cell population. For inference, we assign each cell to a network with a certain probability and iteratively update the optimal networks and cell probabilities in an Expectation Maximization scheme. We validate our method in the controlled setting of a simulation study and apply it to three data sets of pooled CRISPR screens generated previously by two novel experimental techniques, namely Crop-Seq and Perturb-Seq.AvailabilityThe mixture Nested Effects Model (M&NEM) is available as the R-package mnem at https://github.com/cbgethz/mnem/[email protected], [email protected] informationSupplementary data are available.online.

Download Full-text

Scalable Clustering with Supervised Linkage Methods

10.1101/2021.08.01.454697 ◽

2021 ◽

Author(s):

James Anibal ◽

Alexandre Day ◽

Erol Bahadiroglu ◽

Liam O'Neill ◽

Long Phan ◽

...

Keyword(s):

Single Cell ◽

Clustering Algorithm ◽

Clustering Algorithms ◽

Biomedical Sciences ◽

New Approach ◽

Scalable Clustering ◽

Linkage Methods ◽

Density Clustering ◽

Cell Data ◽

Different Levels

Data clustering plays a significant role in biomedical sciences, particularly in single-cell data analysis. Researchers use clustering algorithms to group individual cells into populations that can be evaluated across different levels of disease progression, drug response, and other clinical statuses. In many cases, multiple sets of clusters must be generated to assess varying levels of cluster specificity. For example, there are many subtypes of leukocytes (e.g. T cells), whose individual preponderance and phenotype must be assessed for statistical/functional significance. In this report, we introduce a novel hierarchical density clustering algorithm (HAL-x) that uses supervised linkage methods to build a cluster hierarchy on raw single-cell data. With this new approach, HAL-x can quickly predict multiple sets of labels for immense datasets, achieving a considerable improvement in computational efficiency on large datasets compared to existing methods. We also show that cell clusters generated by HAL-x yield near-perfect F1-scores when classifying different clinical statuses based on single-cell profiles. Our hierarchical density clustering algorithm achieves high accuracy in single cell classification in a scalable, tunable and rapid manner. We make HAL-x publicly available at: https://pypi.org/project/hal-x/

Download Full-text

Scirpy: a Scanpy extension for analyzing single-cell T-cell receptor-sequencing data

Bioinformatics ◽

10.1093/bioinformatics/btaa611 ◽

2020 ◽

Vol 36 (18) ◽

pp. 4817-4818 ◽

Cited By ~ 2

Author(s):

Gregor Sturm ◽

Tamas Szabo ◽

Georgios Fotakis ◽

Marlene Haider ◽

Dietmar Rieder ◽

...

Keyword(s):

T Cell ◽

Single Cell ◽

Large Scale ◽

Single Cells ◽

Cell Receptor ◽

Supplementary Information ◽

Sequencing Data ◽

Seamless Integration ◽

T Cell Phenotypes ◽

Cell Phenotypes

Abstract Summary Advances in single-cell technologies have enabled the investigation of T-cell phenotypes and repertoires at unprecedented resolution and scale. Bioinformatic methods for the efficient analysis of these large-scale datasets are instrumental for advancing our understanding of adaptive immune responses. However, while well-established solutions are accessible for the processing of single-cell transcriptomes, no streamlined pipelines are available for the comprehensive characterization of T-cell receptors. Here, we propose single-cell immune repertoires in Python (Scirpy), a scalable Python toolkit that provides simplified access to the analysis and visualization of immune repertoires from single cells and seamless integration with transcriptomic data. Availability and implementation Scirpy source code and documentation are available at https://github.com/icbi-lab/scirpy. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Deep soft K-means clustering with self-training for single-cell RNA sequence data

NAR Genomics and Bioinformatics ◽

10.1093/nargab/lqaa039 ◽

2020 ◽

Vol 2 (2) ◽

Cited By ~ 2

Author(s):

Liang Chen ◽

Weinan Wang ◽

Yuyao Zhai ◽

Minghua Deng

Keyword(s):

Deep Learning ◽

Single Cell ◽

Large Scale ◽

Sequence Data ◽

Dimensional Space ◽

Expression Profiles ◽

Single Cells ◽

Clustering Algorithms ◽

Training Procedure ◽

Latent Space

Abstract Single-cell RNA sequencing (scRNA-seq) allows researchers to study cell heterogeneity at the cellular level. A crucial step in analyzing scRNA-seq data is to cluster cells into subpopulations to facilitate subsequent downstream analysis. However, frequent dropout events and increasing size of scRNA-seq data make clustering such high-dimensional, sparse and massive transcriptional expression profiles challenging. Although some existing deep learning-based clustering algorithms for single cells combine dimensionality reduction with clustering, they either ignore the distance and affinity constraints between similar cells or make some additional latent space assumptions like mixture Gaussian distribution, failing to learn cluster-friendly low-dimensional space. Therefore, in this paper, we combine the deep learning technique with the use of a denoising autoencoder to characterize scRNA-seq data while propose a soft self-training K-means algorithm to cluster the cell population in the learned latent space. The self-training procedure can effectively aggregate the similar cells and pursue more cluster-friendly latent space. Our method, called ‘scziDesk’, alternately performs data compression, data reconstruction and soft clustering iteratively, and the results exhibit excellent compatibility and robustness in both simulated and real data. Moreover, our proposed method has perfect scalability in line with cell size on large-scale datasets.

Download Full-text

dropClust: Efficient clustering of ultra-large scRNA-seq data

10.1101/170308 ◽

2017 ◽

Cited By ~ 2

Author(s):

Debajyoti Sinha ◽

Akhilesh Kumar ◽

Himanshu Kumar ◽

Sanghamitra Bandyopadhyay ◽

Debarka Sengupta

Keyword(s):

Single Cell ◽

Large Scale ◽

Best Practice ◽

Clustering Algorithm ◽

Nearest Neighbor ◽

De Novo ◽

Single Cells ◽

Nearest Neighbor Search ◽

Locality Sensitive Hashing ◽

Clustering Methods

ABSTRACTDroplet based single cell transcriptomics has recently enabled parallel screening of tens of thousands of single cells. Clustering methods that scale for such high dimensional data without compromising accuracy are scarce. We exploit Locality Sensitive Hashing, an approximate nearest neighbor search technique to develop ade novoclustering algorithm for large-scale single cell data. On a number of real datasets, dropClust outperformed the existing best practice methods in terms of execution time, clustering accuracy and detectability of minor cell sub-types.

Download Full-text

Benchmarking PSM identification tools for single cell proteomics

10.1101/2021.08.17.456676 ◽

2021 ◽

Author(s):

Daisha Van Der Watt ◽

Hannah Boekweg ◽

Thy Truong ◽

Amanda J Guise ◽

Edward D Plowey ◽

...

Keyword(s):

Machine Learning ◽

Single Cell ◽

Single Cells ◽

Peptide Identification ◽

Machine Learning Algorithms ◽

Cellular Heterogeneity ◽

Proteomics Data ◽

Improve Performance ◽

False Discovery ◽

Cell Data

AbstractSingle cell proteomics is an emerging sub-field within proteomics with the potential to revolutionize our understanding of cellular heterogeneity and interactions. Recent efforts have largely focused on technological advancements in sample preparation, chromatography and instrumentation to enable measuring proteins present in these ultra-limited samples. Although advancements in data acquisition have rapidly improved our ability to analyze single cells, the software pipelines used in data analysis were originally written for traditional bulk samples and their performance on single cell data has not been investigated. We benchmarked five popular peptide identification tools on single cell proteomics data. We found that MetaMorpheus achieved the greatest number of peptide spectrum matches at a 1% false discovery rate. Depending on the tool, we also find that post processing machine learning can improve spectrum identification results by up to ∼40%. Although rescoring leads to a greater number of peptide spectrum matches, these new results typically are generated by 3rd party tools and have no way of being utilized by the primary pipeline for quantification. Exploration of novel metrics for machine learning algorithms will continue to improve performance.

Download Full-text

rCASC: reproducible Classification Analysis of Single Cell sequencing data

10.1101/430967 ◽

2018 ◽

Cited By ~ 1

Author(s):

Luca Alessandrì ◽

Marco Beccuti ◽

Maddalena Arigoni ◽

Martina Olivero ◽

Greta Romano ◽

...

Keyword(s):

Single Cell ◽

Single Cells ◽

R Package ◽

Cellular Heterogeneity ◽

Supplementary Information ◽

Sequencing Data ◽

Single Cell Sequencing ◽

Analysis Workflow ◽

User Friendly ◽

Bioinformatics Workflows

AbstractSummarySingle-cell RNA sequencing has emerged as an essential tool to investigate cellular heterogeneity, and highlighting cell sub-population specific signatures. Nowadays, dedicated and user-friendly bioinformatics workflows are required to exploit the deconvolution of single-cells transcriptome. Furthermore, there is a growing need of bioinformatics workflows granting both functional, i.e. saving information about data and analysis parameters, and computation reproducibility, i.e. storing the real image of the computation environment. Here, we present rCASC a modular RNAseq analysis workflow allowing data analysis from counts generation to cell sub-population signatures identification, granting both functional and computation reproducibility.Availability and ImplementationrCASC is part of the reproducible bioinfomatics project. rCASC is a docker based application controlled by a R package available at https://github.com/kendomaniac/rCASC.Supplementary informationSupplementary data are available at rCASC github

Download Full-text

Single-Cell Transcriptomics Unveils Gene Regulatory Network Plasticity

10.1101/446104 ◽

2018 ◽

Cited By ~ 1

Author(s):

Giovanni Iacono ◽

Ramon Massoni-Badosa ◽

Holger Heyn

Keyword(s):

Single Cell ◽

Regulatory Network ◽

Regulatory Networks ◽

Large Scale ◽

Differential Expression Analysis ◽

Cellular Heterogeneity ◽

Computational Framework ◽

Holistic View ◽

Regulatory Changes ◽

Cell Data

SUMMARYSingle-cell RNA sequencing (scRNA-seq) plays a pivotal role in our understanding of cellular heterogeneity. Current analytical workflows are driven by categorizing principles that consider cells as individual entities and classify them into complex taxonomies. We have devised a conceptually different computational framework based on a holistic view, where single-cell datasets are used to infer global, large-scale regulatory networks. We developed correlation metrics that are specifically tailored to single-cell data, and then generated, validated and interpreted single-cell-derived regulatory networks from organs and perturbed systems, such as diabetes and Alzheimer’s disease. Using advanced tools from graph theory, we computed an unbiased quantification of a gene’s biological relevance, and accurately pinpointed key players in organ function and drivers of diseases. Our approach detected multiple latent regulatory changes that are invisible to single-cell workflows based on clustering or differential expression analysis. In summary, we have established the feasibility and value of regulatory network analysis using scRNA-seq datasets, which significantly broadens the biological insights that can be obtained with this leading technology.

Download Full-text

scMatch: a single-cell gene expression profile annotation tool using reference datasets

Bioinformatics ◽

10.1093/bioinformatics/btz292 ◽

2019 ◽

Vol 35 (22) ◽

pp. 4688-4695 ◽

Cited By ~ 22

Author(s):

Rui Hou ◽

Elena Denisenko ◽

Alistair R R Forrest

Keyword(s):

Gene Expression ◽

Single Cell ◽

Large Scale ◽

Expression Profiles ◽

Single Cells ◽

Gene Expression Profiles ◽

Supplementary Information ◽

Annotation Tool ◽

Sequencing Data ◽

Multiple Sources

Abstract Motivation Single-cell RNA sequencing (scRNA-seq) measures gene expression at the resolution of individual cells. Massively multiplexed single-cell profiling has enabled large-scale transcriptional analyses of thousands of cells in complex tissues. In most cases, the true identity of individual cells is unknown and needs to be inferred from the transcriptomic data. Existing methods typically cluster (group) cells based on similarities of their gene expression profiles and assign the same identity to all cells within each cluster using the averaged expression levels. However, scRNA-seq experiments typically produce low-coverage sequencing data for each cell, which hinders the clustering process. Results We introduce scMatch, which directly annotates single cells by identifying their closest match in large reference datasets. We used this strategy to annotate various single-cell datasets and evaluated the impacts of sequencing depth, similarity metric and reference datasets. We found that scMatch can rapidly and robustly annotate single cells with comparable accuracy to another recent cell annotation tool (SingleR), but that it is quicker and can handle larger reference datasets. We demonstrate how scMatch can handle large customized reference gene expression profiles that combine data from multiple sources, thus empowering researchers to identify cell populations in any complex tissue with the desired precision. Availability and implementation scMatch (Python code) and the FANTOM5 reference dataset are freely available to the research community here https://github.com/forrest-lab/scMatch. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text