SCHNEL: scalable clustering of high dimensional single-cell data

Tamim Abdelaal; Paul de Raadt; Boudewijn P F Lelieveldt; Marcel J T Reinders; Ahmed Mahfouz

doi:10.1093/bioinformatics/btaa816

SCHNEL: scalable clustering of high dimensional single-cell data

Bioinformatics ◽

10.1093/bioinformatics/btaa816 ◽

2020 ◽

Vol 36 (Supplement_2) ◽

pp. i849-i856

Author(s):

Tamim Abdelaal ◽

Paul de Raadt ◽

Boudewijn P F Lelieveldt ◽

Marcel J T Reinders ◽

Ahmed Mahfouz

Keyword(s):

Single Cell ◽

Graph Clustering ◽

Original Data ◽

Large Datasets ◽

Supplementary Information ◽

High Dimensional ◽

Sequencing Data ◽

Novel Approach ◽

Cellular Markers ◽

Cell Data

Abstract Motivation Single cell data measures multiple cellular markers at the single-cell level for thousands to millions of cells. Identification of distinct cell populations is a key step for further biological understanding, usually performed by clustering this data. Dimensionality reduction based clustering tools are either not scalable to large datasets containing millions of cells, or not fully automated requiring an initial manual estimation of the number of clusters. Graph clustering tools provide automated and reliable clustering for single cell data, but suffer heavily from scalability to large datasets. Results We developed SCHNEL, a scalable, reliable and automated clustering tool for high-dimensional single-cell data. SCHNEL transforms large high-dimensional data to a hierarchy of datasets containing subsets of data points following the original data manifold. The novel approach of SCHNEL combines this hierarchical representation of the data with graph clustering, making graph clustering scalable to millions of cells. Using seven different cytometry datasets, SCHNEL outperformed three popular clustering tools for cytometry data, and was able to produce meaningful clustering results for datasets of 3.5 and 17.2 million cells within workable time frames. In addition, we show that SCHNEL is a general clustering tool by applying it to single-cell RNA sequencing data, as well as a popular machine learning benchmark dataset MNIST. Availability and implementation Implementation is available on GitHub (https://github.com/biovault/SCHNELpy). All datasets used in this study are publicly available. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

SCHNEL: Scalable clustering of high dimensional single-cell data

10.1101/2020.03.30.015925 ◽

2020 ◽

Cited By ~ 1

Author(s):

Tamim Abdelaal ◽

Paul de Raadt ◽

Boudewijn P.F. Lelieveldt ◽

Marcel J.T. Reinders ◽

Ahmed Mahfouz

Keyword(s):

Single Cell ◽

Graph Clustering ◽

Original Data ◽

Large Datasets ◽

Supplementary Information ◽

High Dimensional ◽

Sequencing Data ◽

Novel Approach ◽

Cellular Markers ◽

Cell Data

AbstractMotivationSingle cell data measures multiple cellular markers at the single-cell level for thousands to millions of cells. Identification of distinct cell populations is a key step for further biological understanding, usually performed by clustering this data. Dimensionality reduction based clustering tools are either not scalable to large datasets containing millions of cells, or not fully automated requiring an initial manual estimation of the number of clusters. Graph clustering tools provide automated and reliable clustering for single cell data, but suffer heavily from scalability to large datasets.ResultsWe developed SCHNEL, a scalable, reliable and automated clustering tool for high-dimensional single-cell data. SCHNEL transforms large high-dimensional data to a hierarchy of datasets containing subsets of data points following the original data manifold. The novel approach of SCHNEL combines this hierarchical representation of the data with graph clustering, making graph clustering scalable to millions of cells. Using seven different cytometry datasets, SCHNEL outperformed three popular clustering tools for cytometry data, and was able to produce meaningful clustering results for datasets of 3.5 and 17.2 million cells within workable timeframes. In addition, we show that SCHNEL is a general clustering tool by applying it to single-cell RNA sequencing data, as well as a popular machine learning benchmark dataset MNIST.Availability and ImplementationImplementation is available on GitHub (https://github.com/paulderaadt/HSNE-clustering)[email protected] informationSupplementary data are available at Bioinformatics online.

Download Full-text

CyTOFmerge: integrating mass cytometry data across multiple panels

Bioinformatics ◽

10.1093/bioinformatics/btz180 ◽

2019 ◽

Vol 35 (20) ◽

pp. 4063-4071 ◽

Cited By ~ 3

Author(s):

Tamim Abdelaal ◽

Thomas Höllt ◽

Vincent van Unen ◽

Boudewijn P F Lelieveldt ◽

Frits Koning ◽

...

Keyword(s):

Single Cell ◽

Biological Sample ◽

Supplementary Information ◽

High Dimensional ◽

Single Cell Level ◽

Supplementary Data ◽

Mass Cytometry ◽

Cell Level ◽

Cellular Markers

Abstract Motivation High-dimensional mass cytometry (CyTOF) allows the simultaneous measurement of multiple cellular markers at single-cell level, providing a comprehensive view of cell compositions. However, the power of CyTOF to explore the full heterogeneity of a biological sample at the single-cell level is currently limited by the number of markers measured simultaneously on a single panel. Results To extend the number of markers per cell, we propose an in silico method to integrate CyTOF datasets measured using multiple panels that share a set of markers. Additionally, we present an approach to select the most informative markers from an existing CyTOF dataset to be used as a shared marker set between panels. We demonstrate the feasibility of our methods by evaluating the quality of clustering and neighborhood preservation of the integrated dataset, on two public CyTOF datasets. We illustrate that by computationally extending the number of markers we can further untangle the heterogeneity of mass cytometry data, including rare cell-population detection. Availability and implementation Implementation is available on GitHub (https://github.com/tabdelaal/CyTOFmerge). Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

GPseudoClust: deconvolution of shared pseudo-profiles at single-cell resolution

10.1101/567115 ◽

2019 ◽

Author(s):

Magdalena E Strauss ◽

Paul DW Kirk ◽

John E Reid ◽

Lorenz Wernisch

Keyword(s):

Single Cell ◽

Time Course ◽

Gene Clusters ◽

Supplementary Information ◽

Clustering Methods ◽

Link Type ◽

Novel Approach ◽

Broad Array ◽

Recent Method ◽

Cell Data

AbstractMotivationMany methods have been developed to cluster genes on the basis of their changes in mRNA expression over time, using bulk RNA-seq or microarray data. However, single-cell data may present a particular challenge for these algorithms, since the temporal ordering of cells is not directly observed. One way to address this is to first use pseudotime methods to order the cells, and then apply clustering techniques for time course data. However, pseudotime estimates are subject to high levels of uncertainty, and failing to account for this uncertainty is liable to lead to erroneous and/or over-confident gene clusters.ResultsThe proposed method, GPseudoClust, is a novel approach that jointly infers pseudotem-poral ordering and gene clusters, and quantifies the uncertainty in both. GPseudoClust combines a recent method for pseudotime inference with nonparametric Bayesian clustering methods, efficient MCMC sampling, and novel subsampling strategies which aid computation. We consider a broad array of simulated and experimental datasets to demonstrate the effectiveness of GPseudoClust in a range of settings.AvailabilityAn implementation is available on GitHub: https://github.com/magStra/nonparametricSummaryPSM and https://github.com/magStra/[email protected] informationSupplementary materials are available.

Download Full-text

scds: computational annotation of doublets in single-cell RNA sequencing data

Bioinformatics ◽

10.1093/bioinformatics/btz698 ◽

2019 ◽

Cited By ~ 3

Author(s):

Abha S Bais ◽

Dennis Kostka

Keyword(s):

Single Cell ◽

Rna Sequencing ◽

Binary Classification ◽

Single Cells ◽

Computational Cost ◽

Original Data ◽

R Package ◽

Supplementary Information ◽

Sequencing Data ◽

Single Cell Rna Sequencing

Abstract Motivation Single-cell RNA sequencing (scRNA-seq) technologies enable the study of transcriptional heterogeneity at the resolution of individual cells and have an increasing impact on biomedical research. However, it is known that these methods sometimes wrongly consider two or more cells as single cells, and that a number of so-called doublets is present in the output of such experiments. Treating doublets as single cells in downstream analyses can severely bias a study’s conclusions, and therefore computational strategies for the identification of doublets are needed. Results With scds, we propose two new approaches for in silico doublet identification: Co-expression based doublet scoring (cxds) and binary classification based doublet scoring (bcds). The co-expression based approach, cxds, utilizes binarized (absence/presence) gene expression data and, employing a binomial model for the co-expression of pairs of genes, yields interpretable doublet annotations. bcds, on the other hand, uses a binary classification approach to discriminate artificial doublets from original data. We apply our methods and existing computational doublet identification approaches to four datasets with experimental doublet annotations and find that our methods perform at least as well as the state of the art, at comparably little computational cost. We observe appreciable differences between methods and across datasets and that no approach dominates all others. In summary, scds presents a scalable, competitive approach that allows for doublet annotation of datasets with thousands of cells in a matter of seconds. Availability and implementation scds is implemented as a Bioconductor R package (doi: 10.18129/B9.bioc.scds). Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

GPseudoClust: deconvolution of shared pseudo-profiles at single-cell resolution

Bioinformatics ◽

10.1093/bioinformatics/btz778 ◽

2019 ◽

Author(s):

Magdalena E Strauss ◽

Paul D W Kirk ◽

John E Reid ◽

Lorenz Wernisch

Keyword(s):

Single Cell ◽

Time Course ◽

Gene Clusters ◽

Supplementary Information ◽

Rna Seq ◽

Clustering Methods ◽

Novel Approach ◽

Broad Array ◽

Recent Method ◽

Cell Data

Abstract Motivation Many methods have been developed to cluster genes on the basis of their changes in mRNA expression over time, using bulk RNA-seq or microarray data. However, single-cell data may present a particular challenge for these algorithms, since the temporal ordering of cells is not directly observed. One way to address this is to first use pseudotime methods to order the cells, and then apply clustering techniques for time course data. However, pseudotime estimates are subject to high levels of uncertainty, and failing to account for this uncertainty is liable to lead to erroneous and/or over-confident gene clusters. Results The proposed method, GPseudoClust, is a novel approach that jointly infers pseudotemporal ordering and gene clusters, and quantifies the uncertainty in both. GPseudoClust combines a recent method for pseudotime inference with nonparametric Bayesian clustering methods, efficient MCMC sampling, and novel subsampling strategies which aid computation.We consider a broad array of simulated and experimental datasets to demonstrate the effectiveness of GPseudoClust in a range of settings. Availability An implementation is available on GitHub: https://github.com/magStra/nonparametricSummaryPSM and https://github.com/magStra/GPseudoClust. Supplementary Information Supplementary data are available at Bioinformatics online.

Download Full-text

Statistical test of structured continuous trees based on discordance matrix

Bioinformatics ◽

10.1093/bioinformatics/btz425 ◽

2019 ◽

Vol 35 (23) ◽

pp. 4962-4970

Author(s):

Xiangqi Bai ◽

Liang Ma ◽

Lin Wan

Keyword(s):

Single Cell ◽

Cell Fate ◽

Matrix Theory ◽

Continuous Process ◽

Cell Types ◽

Supplementary Information ◽

High Dimensional ◽

Intrinsic Structure ◽

Statistical Framework ◽

Cell Data

Abstract Motivation Cell fate determination is a continuous process in which one cell type diversifies to other cell types following a hierarchical path. Advancements in single-cell technologies provide the opportunity to reveal the continuum of cell progression which forms a structured continuous tree (SCTree). Computational algorithms, which are usually based on a priori assumptions on the hidden structures, have previously been proposed as a means of recovering pseudo trajectory along cell differentiation process. However, there still lack of statistical framework on the assessments of intrinsic structure embedded in high-dimensional gene expression profile. Inherit noise and cell-to-cell variation underlie the single-cell data, however, pose grand challenges to testing even basic structures, such as linear versus bifurcation. Results In this study, we propose an adaptive statistical framework, termed SCTree, to test the intrinsic structure of a high-dimensional single-cell dataset. SCTree test is conducted based on the tools derived from metric geometry and random matrix theory. In brief, by extending the Gromov–Farris transform and utilizing semicircular law, we formulate the continuous tree structure testing problem into a signal matrix detection problem. We show that the SCTree test is most powerful when the signal-to-noise ratio exceeds a moderate value. We also demonstrate that SCTree is able to robustly detect linear, single and multiple branching events with simulated datasets and real scRNA-seq datasets. Overall, the SCTree test provides a unified statistical assessment of the significance of the hidden structure of single-cell data. Availability and implementation SCTree software is available at https://github.com/XQBai/SCTree-test. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Using single-cell cytometry to illustrate integrated multi-perspective evaluation of clustering algorithms using Pareto fronts

Bioinformatics ◽

10.1093/bioinformatics/btab038 ◽

2021 ◽

Author(s):

Givanna H Putri ◽

Irena Koprinska ◽

Thomas M Ashhurst ◽

Nicholas J C King ◽

Mark N Read

Keyword(s):

Single Cell ◽

Performance Metrics ◽

Clustering Algorithms ◽

Latin Hypercube Sampling ◽

Supplementary Information ◽

Sequencing Data ◽

Evaluation Protocol ◽

Benchmark Datasets ◽

Pareto Fronts ◽

Parameter Values

Abstract Motivation Many ‘automated gating’ algorithms now exist to cluster cytometry and single-cell sequencing data into discrete populations. Comparative algorithm evaluations on benchmark datasets rely either on a single performance metric, or a few metrics considered independently of one another. However, single metrics emphasize different aspects of clustering performance and do not rank clustering solutions in the same order. This underlies the lack of consensus between comparative studies regarding optimal clustering algorithms and undermines the translatability of results onto other non-benchmark datasets. Results We propose the Pareto fronts framework as an integrative evaluation protocol, wherein individual metrics are instead leveraged as complementary perspectives. Judged superior are algorithms that provide the best trade-off between the multiple metrics considered simultaneously. This yields a more comprehensive and complete view of clustering performance. Moreover, by broadly and systematically sampling algorithm parameter values using the Latin Hypercube sampling method, our evaluation protocol minimizes (un)fortunate parameter value selections as confounding factors. Furthermore, it reveals how meticulously each algorithm must be tuned in order to obtain good results, vital knowledge for users with novel data. We exemplify the protocol by conducting a comparative study between three clustering algorithms (ChronoClust, FlowSOM and Phenograph) using four common performance metrics applied across four cytometry benchmark datasets. To our knowledge, this is the first time Pareto fronts have been used to evaluate the performance of clustering algorithms in any application domain. Availability and implementation Implementation of our Pareto front methodology and all scripts and datasets to reproduce this article are available at https://github.com/ghar1821/ParetoBench. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

484 Bioturing browser: interactively explore public single cell sequencing data

Journal for ImmunoTherapy of Cancer ◽

10.1136/jitc-2020-sitc2020.0484 ◽

2020 ◽

Vol 8 (Suppl 3) ◽

pp. A520-A520

Author(s):

Son Pham ◽

Tri Le ◽

Tan Phan ◽

Minh Pham ◽

Huy Nguyen ◽

...

Keyword(s):

Single Cell ◽

Immune Cell ◽

Expression Profiles ◽

Meta Analysis ◽

Cell Types ◽

Sequencing Data ◽

Single Cell Sequencing ◽

Data Formats ◽

Cancer Types ◽

Cell Data

BackgroundSingle-cell sequencing technology has opened an unprecedented ability to interrogate cancer. It reveals significant insights into the intratumoral heterogeneity, metastasis, therapeutic resistance, which facilitates target discovery and validation in cancer treatment. With rapid advancements in throughput and strategies, a particular immuno-oncology study can produce multi-omics profiles for several thousands of individual cells. This overflow of single-cell data poses formidable challenges, including standardizing data formats across studies, performing reanalysis for individual datasets and meta-analysis.MethodsN/AResultsWe present BioTuring Browser, an interactive platform for accessing and reanalyzing published single-cell omics data. The platform is currently hosting a curated database of more than 10 million cells from 247 projects, covering more than 120 immune cell types and subtypes, and 15 different cancer types. All data are processed and annotated with standardized labels of cell types, diseases, therapeutic responses, etc. to be instantly accessed and explored in a uniform visualization and analytics interface. Based on this massive curated database, BioTuring Browser supports searching similar expression profiles, querying a target across datasets and automatic cell type annotation. The platform supports single-cell RNA-seq, CITE-seq and TCR-seq data. BioTuring Browser is now available for download at www.bioturing.com.ConclusionsN/A

Download Full-text

Discovering a sparse set of pairwise discriminating features in high-dimensional data

Bioinformatics ◽

10.1093/bioinformatics/btaa690 ◽

2020 ◽

Author(s):

Samuel Melton ◽

Sharad Ramanathan

Keyword(s):

Single Cell ◽

Dimensional Space ◽

Cell Types ◽

Dimensional Subspace ◽

Supplementary Information ◽

High Dimensional ◽

Technological Advances ◽

Data Points ◽

Low Dimensional ◽

Sparse Set

Abstract Motivation Recent technological advances produce a wealth of high-dimensional descriptions of biological processes, yet extracting meaningful insight and mechanistic understanding from these data remains challenging. For example, in developmental biology, the dynamics of differentiation can now be mapped quantitatively using single-cell RNA sequencing, yet it is difficult to infer molecular regulators of developmental transitions. Here, we show that discovering informative features in the data is crucial for statistical analysis as well as making experimental predictions. Results We identify features based on their ability to discriminate between clusters of the data points. We define a class of problems in which linear separability of clusters is hidden in a low-dimensional space. We propose an unsupervised method to identify the subset of features that define a low-dimensional subspace in which clustering can be conducted. This is achieved by averaging over discriminators trained on an ensemble of proposed cluster configurations. We then apply our method to single-cell RNA-seq data from mouse gastrulation, and identify 27 key transcription factors (out of 409 total), 18 of which are known to define cell states through their expression levels. In this inferred subspace, we find clear signatures of known cell types that eluded classification prior to discovery of the correct low-dimensional subspace. Availability and implementation https://github.com/smelton/SMD. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

breakpointR: an R/Bioconductor package to localize strand state changes in Strand-seq data

Bioinformatics ◽

10.1093/bioinformatics/btz681 ◽

2019 ◽

Cited By ~ 4

Author(s):

David Porubsky ◽

Ashley D Sanders ◽

Aaron Taudt ◽

Maria Colomé-Tatché ◽

Peter M Lansdorp ◽

...

Keyword(s):

Single Cell ◽

Sister Chromatid Exchanges ◽

Supplementary Information ◽

Specific Information ◽

Bioconductor Package ◽

Sequencing Data ◽

Sequencing Technique ◽

Important Addition ◽

State Changes ◽

Chromatid Exchanges

Abstract Motivation Strand-seq is a specialized single-cell DNA sequencing technique centered around the directionality of single-stranded DNA. Computational tools for Strand-seq analyses must capture the strand-specific information embedded in these data. Results Here we introduce breakpointR, an R/Bioconductor package specifically tailored to process and interpret single-cell strand-specific sequencing data obtained from Strand-seq. We developed breakpointR to detect local changes in strand directionality of aligned Strand-seq data, to enable fine-mapping of sister chromatid exchanges, germline inversion and to support global haplotype assembly. Given the broad spectrum of Strand-seq applications we expect breakpointR to be an important addition to currently available tools and extend the accessibility of this novel sequencing technique. Availability and implementation R/Bioconductor package https://bioconductor.org/packages/breakpointR. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text