destiny: diffusion maps for large-scale single-cell data in R

Philipp Angerer; Laleh Haghverdi; Maren Büttner; Fabian J. Theis; Carsten Marr; Florian Buettner

doi:10.1093/bioinformatics/btv715

destiny – diffusion maps for large-scale single-cell data in R

10.1101/023309 ◽

2015 ◽

Cited By ~ 6

Author(s):

Philipp Angerer ◽

Laleh Haghverdi ◽

Maren Büttner ◽

Fabian J. Theis ◽

Carsten Marr ◽

...

Keyword(s):

Single Cell ◽

Large Scale ◽

Cellular Reprogramming ◽

Noise Model ◽

Diffusion Maps ◽

Time Resolved ◽

Describing Functions ◽

Link Type ◽

Cell Expression ◽

Cell Data

ABSTRACTSummaryDiffusion maps are a spectral method for non-linear dimension reduction and have recently been adapted for the visualization of single cell expression data. Here we present destiny, an efficient R implementation of the diffusion map algorithm. Our package includes a single-cell specific noise model allowing for missing and censored values. In contrast to previous implementations, we further present an efficient nearest-neighbour approximation that allows for the processing of hundreds of thousands of cells and a functionality for projecting new data on existing diffusion maps. We exemplarily apply destiny to a recent time-resolved mass cytometry dataset of cellular reprogramming.Availability and implementationdestiny is an open-source R/Bioconductor package http://bioconductor.org/packages/ destiny also available at https://www.helmholtz-muenchen.de/icb/destiny. A detailed vignette describing functions and workflows is provided with the [email protected], [email protected]

Download Full-text

EpiScanpy: integrated single-cell epigenomic analysis

10.1101/648097 ◽

2019 ◽

Cited By ~ 4

Author(s):

Anna Danese ◽

Maria L. Richter ◽

David S. Fischer ◽

Fabian J. Theis ◽

Maria Colomé-Tatché

Keyword(s):

Dna Methylation ◽

Single Cell ◽

Large Scale ◽

Feature Space ◽

Rna Seq ◽

Computational Framework ◽

Learning Techniques ◽

Multiple Feature ◽

The Many ◽

Cell Data

ABSTRACTEpigenetic single-cell measurements reveal a layer of regulatory information not accessible to single-cell transcriptomics, however single-cell-omics analysis tools mainly focus on gene expression data. To address this issue, we present epiScanpy, a computational framework for the analysis of single-cell DNA methylation and single-cell ATAC-seq data. EpiScanpy makes the many existing RNA-seq workflows from scanpy available to large-scale single-cell data from other -omics modalities. We introduce and compare multiple feature space constructions for epigenetic data and show the feasibility of common clustering, dimension reduction and trajectory learning techniques. We benchmark epiScanpy by interrogating different single-cell brain mouse atlases of DNA methylation, ATAC-seq and transcriptomics. We find that differentially methylated and differentially open markers between cell clusters enrich transcriptome-based cell type labels by orthogonal epigenetic information.

Download Full-text

scDIOR: single cell RNA-seq data IO software

BMC Bioinformatics ◽

10.1186/s12859-021-04528-3 ◽

2022 ◽

Vol 23 (1) ◽

Author(s):

Huijian Feng ◽

Lihui Lin ◽

Jiekai Chen

Keyword(s):

Single Cell ◽

Programming Languages ◽

Large Scale ◽

Developmental Trajectories ◽

Rapid Development ◽

Data Transformation ◽

Rna Seq ◽

Data Types ◽

User Friendly ◽

Cell Data

Abstract Background Single-cell RNA sequencing is becoming a powerful tool to identify cell states, reconstruct developmental trajectories, and deconvolute spatial expression. The rapid development of computational methods promotes the insight of heterogeneous single-cell data. An increasing number of tools have been provided for biological analysts, of which two programming languages- R and Python are widely used among researchers. R and Python are complementary, as many methods are implemented specifically in R or Python. However, the different platforms immediately caused the data sharing and transformation problem, especially for Scanpy, Seurat, and SingleCellExperiemnt. Currently, there is no efficient and user-friendly software to perform data transformation of single-cell omics between platforms, which makes users spend unbearable time on data Input and Output (IO), significantly reducing the efficiency of data analysis. Results We developed scDIOR for single-cell data transformation between platforms of R and Python based on Hierarchical Data Format Version 5 (HDF5). We have created a data IO ecosystem between three R packages (Seurat, SingleCellExperiment, Monocle) and a Python package (Scanpy). Importantly, scDIOR accommodates a variety of data types across programming languages and platforms in an ultrafast way, including single-cell RNA-seq and spatial resolved transcriptomics data, using only a few codes in IDE or command line interface. For large scale datasets, users can partially load the needed information, e.g., cell annotation without the gene expression matrices. scDIOR connects the analytical tasks of different platforms, which makes it easy to compare the performance of algorithms between them. Conclusions scDIOR contains two modules, dior in R and diopy in Python. scDIOR is a versatile and user-friendly tool that implements single-cell data transformation between R and Python rapidly and stably. The software is freely accessible at https://github.com/JiekaiLab/scDIOR.

Download Full-text

Exploring Single-Cell Data with Deep Multitasking Neural Networks

10.1101/237065 ◽

2017 ◽

Cited By ~ 7

Author(s):

Matthew Amodio ◽

David van Dijk ◽

Krishnan Srinivasan ◽

William S Chen ◽

Hussein Mohsen ◽

...

Keyword(s):

Neural Network ◽

Neural Networks ◽

Data Analysis ◽

Experimental Design ◽

Single Cell ◽

Large Scale ◽

Dengue Infection ◽

Data Representation ◽

Data Generation ◽

Cell Data

AbstractBiomedical researchers are generating high-throughput, high-dimensional single-cell data at a staggering rate. As costs of data generation decrease, experimental design is moving towards measurement of many different single-cell samples in the same dataset. These samples can correspond to different patients, conditions, or treatments. While scalability of methods to datasets of these sizes is a challenge on its own, dealing with large-scale experimental design presents a whole new set of problems, including batch effects and sample comparison issues. Currently, there are no computational tools that can both handle large amounts of data in a scalable manner (many cells) and at the same time deal with many samples (many patients or conditions). Moreover, data analysis currently involves the use of different tools that each operate on their own data representation, not guaranteeing a synchronized analysis pipeline. For instance, data visualization methods can be disjoint and mismatched with the clustering method. For this purpose, we present SAUCIE, a deep neural network that leverages the high degree of parallelization and scalability offered by neural networks, as well as the deep representation of data that can be learned by them to perform many single-cell data analysis tasks, all on a unified representation.A well-known limitation of neural networks is their interpretability. Our key contribution here are newly formulated regularizations (penalties) that render features learned in hidden layers of the neural network interpretable. When large multi-patient datasets are fed into SAUCIE, the various hidden layers contain denoised and batch-corrected data, a low dimensional visualization, unsupervised clustering, as well as other information that can be used to explore the data. We show this capability by analyzing a newly generated 180-sample dataset consisting of T cells from dengue patients in India, measured with mass cytometry. We show that SAUCIE, for the first time, can batch correct and process this 11-million cell data to identify cluster-based signatures of acute dengue infection and create a patient manifold, stratifying immune response to dengue on the basis of single-cell measurements.

Download Full-text

D-EE: Distributed software for visualizing intrinsic structure of large-scale single-cell data

GigaScience ◽

10.1093/gigascience/giaa126 ◽

2020 ◽

Vol 9 (11) ◽

Cited By ~ 1

Author(s):

Shaokun An ◽

Jizu Huang ◽

Lin Wan

Keyword(s):

Time Series ◽

Dimensionality Reduction ◽

Single Cell ◽

High Performance ◽

Large Scale ◽

Distributed Storage ◽

Distributed Computation ◽

Low Dimensional ◽

Cell Data ◽

Performance Computing

Abstract Background Dimensionality reduction and visualization play vital roles in single-cell RNA sequencing (scRNA-seq) data analysis. While they have been extensively studied, state-of-the-art dimensionality reduction algorithms are often unable to preserve the global structures underlying data. Elastic embedding (EE), a nonlinear dimensionality reduction method, has shown promise in revealing low-dimensional intrinsic local and global data structure. However, the current implementation of the EE algorithm lacks scalability to large-scale scRNA-seq data. Results We present a distributed optimization implementation of the EE algorithm, termed distributed elastic embedding (D-EE). D-EE reveals the low-dimensional intrinsic structures of data with accuracy equal to that of elastic embedding, and it is scalable to large-scale scRNA-seq data. It leverages distributed storage and distributed computation, achieving memory efficiency and high-performance computing simultaneously. In addition, an extended version of D-EE, termed distributed optimization implementation of time-series elastic embedding (D-TSEE), enables the user to visualize large-scale time-series scRNA-seq data by incorporating experimentally temporal information. Results with large-scale scRNA-seq data indicate that D-TSEE can uncover oscillatory gene expression patterns by using experimentally temporal information. Conclusions D-EE is a distributed dimensionality reduction and visualization tool. Its distributed storage and distributed computation technique allow us to efficiently analyze large-scale single-cell data at the cost of constant time speedup. The source code for D-EE algorithm based on C and MPI tailored to a high-performance computing cluster is available at https://github.com/ShaokunAn/D-EE.

Download Full-text

EpiScanpy: integrated single-cell epigenomic analysis

Nature Communications ◽

10.1038/s41467-021-25131-3 ◽

2021 ◽

Vol 12 (1) ◽

Author(s):

Anna Danese ◽

Maria L. Richter ◽

Kridsadakorn Chaichoompu ◽

David S. Fischer ◽

Fabian J. Theis ◽

...

Keyword(s):

Single Cell ◽

Large Scale ◽

Feature Space ◽

Cell Types ◽

Reduction Cell ◽

Learning Techniques ◽

Multiple Feature ◽

The Many ◽

Cell Data ◽

Trajectory Learning

AbstractEpiScanpy is a toolkit for the analysis of single-cell epigenomic data, namely single-cell DNA methylation and single-cell ATAC-seq data. To address the modality specific challenges from epigenomics data, epiScanpy quantifies the epigenome using multiple feature space constructions and builds a nearest neighbour graph using epigenomic distance between cells. EpiScanpy makes the many existing scRNA-seq workflows from scanpy available to large-scale single-cell data from other -omics modalities, including methods for common clustering, dimension reduction, cell type identification and trajectory learning techniques, as well as an atlas integration tool for scATAC-seq datasets. The toolkit also features numerous useful downstream functions, such as differential methylation and differential openness calling, mapping epigenomic features of interest to their nearest gene, or constructing gene activity matrices using chromatin openness. We successfully benchmark epiScanpy against other scATAC-seq analysis tools and show its outperformance at discriminating cell types.

Download Full-text

PARC: ultrafast and accurate clustering of phenotypic data of millions of single cells

Bioinformatics ◽

10.1093/bioinformatics/btaa042 ◽

2020 ◽

Vol 36 (9) ◽

pp. 2778-2786 ◽

Cited By ~ 5

Author(s):

Shobana V Stassen ◽

Dickson M D Siu ◽

Kelvin C M Lee ◽

Joshua W K Ho ◽

Hayden K H So ◽

...

Keyword(s):

Single Cell ◽

Large Scale ◽

Clustering Algorithm ◽

Single Cells ◽

Clustering Algorithms ◽

Cellular Heterogeneity ◽

Supplementary Information ◽

Phenotypic Data ◽

Scalable Algorithm ◽

Cell Data

Abstract Motivation New single-cell technologies continue to fuel the explosive growth in the scale of heterogeneous single-cell data. However, existing computational methods are inadequately scalable to large datasets and therefore cannot uncover the complex cellular heterogeneity. Results We introduce a highly scalable graph-based clustering algorithm PARC—Phenotyping by Accelerated Refined Community-partitioning—for large-scale, high-dimensional single-cell data (>1 million cells). Using large single-cell flow and mass cytometry, RNA-seq and imaging-based biophysical data, we demonstrate that PARC consistently outperforms state-of-the-art clustering algorithms without subsampling of cells, including Phenograph, FlowSOM and Flock, in terms of both speed and ability to robustly detect rare cell populations. For example, PARC can cluster a single-cell dataset of 1.1 million cells within 13 min, compared with >2 h for the next fastest graph-clustering algorithm. Our work presents a scalable algorithm to cope with increasingly large-scale single-cell analysis. Availability and implementation https://github.com/ShobiStassen/PARC. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

NEBULA is a fast negative binomial mixed model for differential or co-expression analysis of large-scale multi-subject single-cell data

Communications Biology ◽

10.1038/s42003-021-02146-6 ◽

2021 ◽

Vol 4 (1) ◽

Author(s):

Liang He ◽

Jose Davila-Velderrain ◽

Tomokazu S. Sumida ◽

David A. Hafler ◽

Manolis Kellis ◽

...

Keyword(s):

Single Cell ◽

Expression Analysis ◽

Mixed Models ◽

Large Scale ◽

Mixed Model ◽

Negative Binomial ◽

Dependent Manner ◽

Cell Level ◽

Cell Data ◽

Negative Binomial Mixed Model

AbstractThe increasing availability of single-cell data revolutionizes the understanding of biological mechanisms at cellular resolution. For differential expression analysis in multi-subject single-cell data, negative binomial mixed models account for both subject-level and cell-level overdispersions, but are computationally demanding. Here, we propose an efficient NEgative Binomial mixed model Using a Large-sample Approximation (NEBULA). The speed gain is achieved by analytically solving high-dimensional integrals instead of using the Laplace approximation. We demonstrate that NEBULA is orders of magnitude faster than existing tools and controls false-positive errors in marker gene identification and co-expression analysis. Using NEBULA in Alzheimer’s disease cohort data sets, we found that the cell-level expression of APOE correlated with that of other genetic risk factors (including CLU, CST3, TREM2, C1q, and ITM2B) in a cell-type-specific pattern and an isoform-dependent manner in microglia. NEBULA opens up a new avenue for the broad application of mixed models to large-scale multi-subject single-cell data.

Download Full-text

A Performance Evaluation Model of Single-Cell Pseudotime Trajectory Inference Algorithms

10.3233/atde210335 ◽

2021 ◽

Author(s):

Jiuru Zhu ◽

Jiaqing Chen ◽

Peizheng Li ◽

Yuanze Chen

Keyword(s):

Performance Evaluation ◽

Environmental Factors ◽

Single Cell ◽

Correlation Coefficient ◽

Large Scale ◽

Evaluation Model ◽

Performance Evaluation Model ◽

Inference Algorithms ◽

Cell Data ◽

A Performance

The study on single-cell pseudotime trajectory is of great significance to the exploration of the environmental factors of life and diseases. The large scale and complexity of single-cell data make the single-cell pseudotime trajectory algorithms face great challenges. A performance evaluation model is proposed to measure the performance of existing pseudotime trajectory inference algorithms and mine the problems existing in the inference algorithms in order to promote the improvement of the inference algorithms. Under the condition of given original single-cell data, the model uses the Spearman correlation coefficient to evaluate the performance of the inference algorithms from noise resistance and robustness. Experiments on the algorithms Monocle2 and Scout were conducted to analyze the application effect of the model.

Download Full-text

PARC: ultrafast and accurate clustering of phenotypic data of millions of single cells

10.1101/765628 ◽

2019 ◽

Author(s):

Shobana V. Stassen ◽

Dickson M. D. Siu ◽

Kelvin C. M. Lee ◽

Joshua W. K. Ho ◽

Hayden K. H. So ◽

...

Keyword(s):

Single Cell ◽

Large Scale ◽

Clustering Algorithm ◽

Single Cells ◽

Clustering Algorithms ◽

Cell Mass ◽

Cellular Heterogeneity ◽

Phenotypic Data ◽

Data Set ◽

Cell Data

AbstractMotivationNew single-cell technologies continue to fuel the explosive growth in the scale of heterogeneous single-cell data. However, existing computational methods are inadequately scalable to large datasets and therefore cannot uncover the complex cellular heterogeneity.ResultsWe introduce a highly scalable graph-based clustering algorithm PARC - phenotyping by accelerated refined community-partitioning – for ultralarge-scale, high-dimensional single-cell data (> 1 million cells). Using large single cell mass cytometry, RNA-seq and imaging-based biophysical data, we demonstrate that PARC consistently outperforms state-of-the-art clustering algorithms without sub-sampling of cells, including Phenograph, FlowSOM, and Flock, in terms of both speed and ability to robustly detect rare cell populations. For example, PARC can cluster a single cell data set of 1.1M cells within 13 minutes, compared to >2 hours to the next fastest graph-clustering algorithm, Phenograph. Our work presents a scalable algorithm to cope with increasingly large-scale single-cell analysis.Availability and Implementationhttps://github.com/ShobiStassen/PARC

Download Full-text