Making many out of few: deep generative models for single-cell RNA-sequencing data

ABSTRACTDeep generative models, such as variational autoencoders (VAEs) or deep Boltzmann machines (DBM), can generate an arbitrary number of synthetic observations after being trained on an initial set of samples. This has mainly been investigated for imaging data but could also be useful for single-cell transcriptomics (scRNA-seq). A small pilot study could be used for planning a full-scale study by investigating planned analysis strategies on synthetic data with different sample sizes. It is unclear whether synthetic observations generated based on a small scRNA-seq dataset reflect the properties relevant for subsequent data analysis steps.We specifically investigated two deep generative modeling approaches, VAEs and DBMs. First, we considered single-cell variational inference (scVI) in two variants, generating samples from the posterior distribution, the standard approach, or the prior distribution. Second, we propose single-cell deep Boltzmann machines (scDBM). When considering the similarity of clustering results on synthetic data to ground-truth clustering, we find that the scVIposterior variant resulted in high variability, most likely due to amplifying artifacts of small data sets. All approaches showed mixed results for cell types with different abundance by overrepresenting highly abundant cell types and missing less abundant cell types. With increasing pilot dataset sizes, the proportions of the cells in each cluster became more similar to that of ground-truth data. We also showed that all approaches learn the univariate distribution of most genes, but problems occurred with bimodality. Overall, the results showed that generative deep learning approaches might be valuable for supporting the design of scRNA-seq experiments.

Download Full-text

Synthetic single cell RNA sequencing data from small pilot studies using deep generative models

Scientific Reports ◽

10.1038/s41598-021-88875-4 ◽

2021 ◽

Vol 11 (1) ◽

Author(s):

Martin Treppner ◽

Adrián Salas-Bastos ◽

Moritz Hess ◽

Stefan Lenz ◽

Tanja Vogel ◽

...

Keyword(s):

Single Cell ◽

Synthetic Data ◽

Ground Truth ◽

Cell Types ◽

Generative Models ◽

Imaging Data ◽

Ground Truth Data ◽

Boltzmann Machines ◽

Deep Boltzmann Machines ◽

Univariate Distribution

AbstractDeep generative models, such as variational autoencoders (VAEs) or deep Boltzmann machines (DBMs), can generate an arbitrary number of synthetic observations after being trained on an initial set of samples. This has mainly been investigated for imaging data but could also be useful for single-cell transcriptomics (scRNA-seq). A small pilot study could be used for planning a full-scale experiment by investigating planned analysis strategies on synthetic data with different sample sizes. It is unclear whether synthetic observations generated based on a small scRNA-seq dataset reflect the properties relevant for subsequent data analysis steps. We specifically investigated two deep generative modeling approaches, VAEs and DBMs. First, we considered single-cell variational inference (scVI) in two variants, generating samples from the posterior distribution, the standard approach, or the prior distribution. Second, we propose single-cell deep Boltzmann machines (scDBMs). When considering the similarity of clustering results on synthetic data to ground-truth clustering, we find that the $$scVI_{posterior}$$ s c V I posterior variant resulted in high variability, most likely due to amplifying artifacts of small datasets. All approaches showed mixed results for cell types with different abundance by overrepresenting highly abundant cell types and missing less abundant cell types. With increasing pilot dataset sizes, the proportions of the cells in each cluster became more similar to that of ground-truth data. We also showed that all approaches learn the univariate distribution of most genes, but problems occurred with bimodality. Across all analyses, in comparing 10$$\times$$ × Genomics and Smart-seq2 technologies, we could show that for 10$$\times$$ × datasets, which have higher sparsity, it is more challenging to make inference from small to larger datasets. Overall, the results show that generative deep learning approaches might be valuable for supporting the design of scRNA-seq experiments.

Download Full-text

treeclimbR pinpoints the data-dependent resolution of hierarchical hypotheses

Genome Biology ◽

10.1186/s13059-021-02368-1 ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Ruizhu Huang ◽

Charlotte Soneson ◽

Pierre-Luc Germain ◽

Thomas S.B. Schmidt ◽

Christian Von Mering ◽

...

Keyword(s):

Single Cell ◽

Synthetic Data ◽

Cell Types ◽

Data Driven ◽

Rna Seq ◽

Hierarchical Trees

AbstracttreeclimbR is for analyzing hierarchical trees of entities, such as phylogenies or cell types, at different resolutions. It proposes multiple candidates that capture the latent signal and pinpoints branches or leaves that contain features of interest, in a data-driven way. It outperforms currently available methods on synthetic data, and we highlight the approach on various applications, including microbiome and microRNA surveys as well as single-cell cytometry and RNA-seq datasets. With the emergence of various multi-resolution genomic datasets, treeclimbR provides a thorough inspection on entities across resolutions and gives additional flexibility to uncover biological associations.

Download Full-text

Probabilistic Harmonization and Annotation of Single-cell Transcriptomics Data with Deep Generative Models

10.1101/532895 ◽

2019 ◽

Cited By ~ 14

Author(s):

Chenling Xu ◽

Romain Lopez ◽

Edouard Mehlman ◽

Jeffrey Regier ◽

Michael I. Jordan ◽

...

Keyword(s):

Single Cell ◽

Probabilistic Approach ◽

Cell Types ◽

Generative Models ◽

Marker Genes ◽

Data Sets ◽

Data Set ◽

Cell State ◽

Transcriptomics Data ◽

Single Data

AbstractAs single-cell transcriptomics becomes a mainstream technology, the natural next step is to integrate the accumulating data in order to achieve a common ontology of cell types and states. However, owing to various nuisance factors of variation, it is not straightforward how to compare gene expression levels across data sets and how to automatically assign cell type labels in a new data set based on existing annotations. In this manuscript, we demonstrate that our previously developed method, scVI, provides an effective and fully probabilistic approach for joint representation and analysis of cohorts of single-cell RNA-seq data sets, while accounting for uncertainty caused by biological and measurement noise. We also introduce single-cell ANnotation using Variational Inference (scANVI), a semi-supervised variant of scVI designed to leverage any available cell state annotations — for instance when only one data set in a cohort is annotated, or when only a few cells in a single data set can be labeled using marker genes. We demonstrate that scVI and scANVI compare favorably to the existing methods for data integration and cell state annotation in terms of accuracy, scalability, and adaptability to challenging settings such as a hierarchical structure of cell state labels. We further show that different from existing methods, scVI and scANVI represent the integrated datasets with a single generative model that can be directly used for any probabilistic decision making task, using differential expression as our case study. scVI and scANVI are available as open source software and can be readily used to facilitate cell state annotation and help ensure consistency and reproducibility across studies.

Download Full-text

A deep learning toolbox for noise-optimized, generalized spike inference from calcium imaging data

10.1101/2020.08.31.272450 ◽

2020 ◽

Cited By ~ 2

Author(s):

Peter Rupprecht ◽

Stefano Carta ◽

Adrian Hoffmann ◽

Mayumi Echizen ◽

Kazuo Kitamura ◽

...

Keyword(s):

Calcium Imaging ◽

Action Potentials ◽

Sampling Rate ◽

Ground Truth ◽

Cell Types ◽

Imaging Data ◽

Calcium Signals ◽

Ground Truth Data ◽

Unseen Data ◽

User Friendly

ABSTRACTCalcium imaging is a key method to record patterns of neuronal activity across populations of identified neurons. Inference of temporal patterns of action potentials (‘spikes’) from calcium signals is, however, challenging and often limited by the scarcity of ground truth data containing simultaneous measurements of action potentials and calcium signals. To overcome this problem, we compiled a large and diverse ground truth database from publicly available and newly performed recordings. This database covers various types of calcium indicators, cell types, and signal-to-noise ratios and comprises a total of >20 hours from 225 neurons. We then developed a novel algorithm for spike inference (CASCADE) that is based on supervised deep networks, takes advantage of the ground truth database, infers absolute spike rates, and outperforms existing model-based algorithms. To optimize performance for unseen imaging data, CASCADE retrains itself by resampling ground truth data to match the respective sampling rate and noise level. As a consequence, no parameters need to be adjusted by the user. To facilitate routine application of CASCADE we developed systematic performance assessments for unseen data, we openly release all resources, and we provide a user-friendly cloud-based implementation.

Download Full-text

Mixture Modeling with Pairwise, Instance-Level Class Constraints

Neural Computation ◽

10.1162/0899766054796914 ◽

2005 ◽

Vol 17 (11) ◽

pp. 2482-2507 ◽

Cited By ~ 14

Author(s):

Qi Zhao ◽

David J. Miller

Keyword(s):

Synthetic Data ◽

Ground Truth ◽

Mixture Modeling ◽

Data Set ◽

Ground Truth Data ◽

New Class ◽

Recent Approach ◽

Semisupervised Clustering ◽

The One ◽

Number Of Classes

The goal of semisupervised clustering/mixture modeling is to learn the underlying groups comprising a given data set when there is also some form of instance-level supervision available, usually in the form of labels or pairwise sample constraints. Most prior work with constraints assumes the number of classes is known, with each learned cluster assumed to be a class and, hence, subject to the given class constraints. When the number of classes is unknown or when the one-cluster-per-class assumption is not valid, the use of constraints may actually be deleterious to learning the ground-truth data groups. We address this by (1) allowing allocation of multiple mixture components to individual classes and (2) estimating both the number of components and the number of classes. We also address new class discovery, with components void of constraints treated as putative unknown classes. For both real-world and synthetic data, our method is shown to accurately estimate the number of classes and to give favorable comparison with the recent approach of Shental, Bar-Hillel, Hertz, and Weinshall (2003).

Download Full-text

CellMixS: quantifying and visualizing batch effects in single cell RNA-seq data

10.1101/2020.12.11.420885 ◽

2020 ◽

Author(s):

Almut Luetge ◽

Joanna Zyprych-Walczak ◽

Urszula Brykczynska Kunzmann ◽

Helena L Crowell ◽

Daniela Calini ◽

...

Keyword(s):

Single Cell ◽

Dimensional Space ◽

Synthetic Data ◽

Real Data ◽

Cell Types ◽

Batch Effects ◽

Cell Type ◽

Cell Type Specificity ◽

Distance Distributions ◽

A Cell

A key challenge in single cell RNA-sequencing (scRNA-seq) data analysis are dataset- and batch-specific differences that can obscure the biological signal of interest. While there are various tools and methods to perform data integration and correct for batch effects, their performance can vary between datasets and according to the nature of the bias. Therefore, it is important to understand how batch effects manifest in order to adjust for them in a reliable way. Here, we systematically explore batch effects in a variety of scRNA-seq datasets according to magnitude, cell type specificity and complexity. We developed a cell-specific mixing score (cms) that quantifies how well cells from multiple batches are mixed. By considering distance distributions (in a lower dimensional space), the score is able to detect local batch bias and differentiate between unbalanced batches (i.e., when one cell type is more abundant in a batch) and systematic differences between cells of the same cell type. We implemented cms and related metrics to detect batch effects or measure structure preservation in the CellMixS R/Bioconductor package. We systematically compare different metrics that have been proposed to quantify batch effects or bias in scRNA-seq data using real datasets with known batch effects and synthetic data that mimic various real data scenarios. While these metrics target the same question and are used interchangeably, we find differences in inter- and intra-dataset scalability, sensitivity and in a metric's ability to handle batch effects with differentially abundant cell types. We find that cell-specific metrics outperform cell type-specific and global metrics and recommend them for both method benchmarks and batch exploration.

Download Full-text

Event generation and statistical sampling for physics with deep generative models and a density information buffer

Nature Communications ◽

10.1038/s41467-021-22616-z ◽

2021 ◽

Vol 12 (1) ◽

Author(s):

Sydney Otten ◽

Sascha Caron ◽

Wieske de Swart ◽

Melissa van Beekveld ◽

Luc Hendriks ◽

...

Keyword(s):

Monte Carlo ◽

Particle Physics ◽

Principal Component ◽

Ground Truth ◽

Generative Models ◽

Detector Response ◽

Quantum Field Theories ◽

Ground Truth Data ◽

Generative Modeling ◽

Event Generation

AbstractSimulating nature and in particular processes in particle physics require expensive computations and sometimes would take much longer than scientists can afford. Here, we explore ways to a solution for this problem by investigating recent advances in generative modeling and present a study for the generation of events from a physical process with deep generative models. The simulation of physical processes requires not only the production of physical events, but to also ensure that these events occur with the correct frequencies. We investigate the feasibility of learning the event generation and the frequency of occurrence with several generative machine learning models to produce events like Monte Carlo generators. We study three processes: a simple two-body decay, the processes e+e− → Z → l+l− and $$pp\to t\bar{t}$$ p p → t t ¯ including the decay of the top quarks and a simulation of the detector response. By buffering density information of encoded Monte Carlo events given the encoder of a Variational Autoencoder we are able to construct a prior for the sampling of new events from the decoder that yields distributions that are in very good agreement with real Monte Carlo events and are generated several orders of magnitude faster. Applications of this work include generic density estimation and sampling, targeted event generation via a principal component analysis of encoded ground truth data, anomaly detection and more efficient importance sampling, e.g., for the phase space integration of matrix elements in quantum field theories.

Download Full-text

Lung nodule detection in chest X-rays using synthetic ground-truth data comparing CNN-based diagnosis to human performance

Scientific Reports ◽

10.1038/s41598-021-94750-z ◽

2021 ◽

Vol 11 (1) ◽

Author(s):

Manuel Schultheiss ◽

Philipp Schmette ◽

Jannis Bodden ◽

Juliane Aichele ◽

Christina Müller-Leisse ◽

...

Keyword(s):

False Positive ◽

Human Performance ◽

False Negative ◽

Lung Nodule ◽

Synthetic Data ◽

Ground Truth ◽

Ct Scans ◽

X Rays ◽

Ground Truth Data ◽

Cad System

AbstractWe present a method to generate synthetic thorax radiographs with realistic nodules from CT scans, and a perfect ground truth knowledge. We evaluated the detection performance of nine radiologists and two convolutional neural networks in a reader study. Nodules were artificially inserted into the lung of a CT volume and synthetic radiographs were obtained by forward-projecting the volume. Hence, our framework allowed for a detailed evaluation of CAD systems’ and radiologists’ performance due to the availability of accurate ground-truth labels for nodules from synthetic data. Radiographs for network training (U-Net and RetinaNet) were generated from 855 CT scans of a public dataset. For the reader study, 201 radiographs were generated from 21 nodule-free CT scans with altering nodule positions, sizes and nodule counts of inserted nodules. Average true positive detections by nine radiologists were 248.8 nodules, 51.7 false positive predicted nodules and 121.2 false negative predicted nodules. The best performing CAD system achieved 268 true positives, 66 false positives and 102 false negatives. Corresponding weighted alternative free response operating characteristic figure-of-merits (wAFROC FOM) for the radiologists range from 0.54 to 0.87 compared to a value of 0.81 (CI 0.75–0.87) for the best performing CNN. The CNN did not perform significantly better against the combined average of the 9 readers (p = 0.49). Paramediastinal nodules accounted for most false positive and false negative detections by readers, which can be explained by the presence of more tissue in this area.

Download Full-text

EmptyNN: A neural network based on positive-unlabeled learning to remove cell-free droplets and recover lost cells in single-cell RNA sequencing data

10.1101/2021.01.15.426387 ◽

2021 ◽

Author(s):

Fangfang Yan ◽

Zhongming Zhao ◽

Lukas M. Simon

Keyword(s):

Neural Network ◽

Single Cell ◽

Rna Sequencing ◽

Ground Truth ◽

Cell Types ◽

Superior Performance ◽

Sequencing Data ◽

Operating Characteristics ◽

Current State ◽

Single Cell Rna Sequencing

ABSTRACTDroplet-based single-cell RNA sequencing (scRNA-seq) has significantly increased the number of cells profiled per experiment and revolutionized the study of individual transcriptomes. However, to maximize the biological signal robust computational methods are needed to distinguish cell-free from cell-containing droplets. Here, we introduce a novel cell-calling algorithm called EmptyNN, which trains a neural network based on positive-unlabeled learning for improved filtering of barcodes. We leveraged cell hashing and genetic variation to provide ground-truth. EmptyNN accurately removed cell-free droplets while recovering lost cell clusters, and achieved an Area Under the Receiver Operating Characteristics (AUROC) of 94.73% and 96.30%, respectively. The comparisons to current state-of-the-art cell-calling algorithms demonstrated the superior performance of EmptyNN, as measured by the number of recovered cell-containing droplets and cell types. EmptyNN was further applied to two additional datasets and showed good performance. Therefore, EmptyNN represents a powerful tool to enhance scRNA-seq quality control analyses.

Download Full-text

Interpretable dimensionality reduction of single cell transcriptome data with deep generative models

10.1101/178624 ◽

2017 ◽

Cited By ~ 4

Author(s):

Jiarui Ding ◽

Anne Condon ◽

Sohrab P. Shah

Keyword(s):

Single Cell ◽

Rna Sequencing ◽

Spatial Organization ◽

Mapping Function ◽

Cell Types ◽

Generative Models ◽

Sequencing Data ◽

Single Cell Rna Sequencing ◽

Data Points ◽

Low Dimensional

Single-cell RNA-sequencing has great potential to discover cell types, identify cell states, trace development lineages, and reconstruct the spatial organization of cells. However, dimension reduction to interpret structure in single-cell sequencing data remains a challenge. Existing algorithms are either not able to uncover the clustering structures in the data, or lose global information such as groups of clusters that are close to each other. We present a robust statistical model, scvis, to capture and visualize the low-dimensional structures in single-cell gene expression data. Simulation results demonstrate that low-dimensional representations learned by scvis preserve both the local and global neighbour structures in the data. In addition, scvis is robust to the number of data points and learns a probabilistic parametric mapping function to add new data points to an existing embedding. We then use scvis to analyze four single-cell RNA-sequencing datasets, exemplifying interpretable two-dimensional representations of the high-dimensional single-cell RNA-sequencing data.

Download Full-text