scholarly journals Synthetic single cell RNA sequencing data from small pilot studies using deep generative models

2021 ◽  
Vol 11 (1) ◽  
Author(s):  
Martin Treppner ◽  
Adrián Salas-Bastos ◽  
Moritz Hess ◽  
Stefan Lenz ◽  
Tanja Vogel ◽  
...  

AbstractDeep generative models, such as variational autoencoders (VAEs) or deep Boltzmann machines (DBMs), can generate an arbitrary number of synthetic observations after being trained on an initial set of samples. This has mainly been investigated for imaging data but could also be useful for single-cell transcriptomics (scRNA-seq). A small pilot study could be used for planning a full-scale experiment by investigating planned analysis strategies on synthetic data with different sample sizes. It is unclear whether synthetic observations generated based on a small scRNA-seq dataset reflect the properties relevant for subsequent data analysis steps. We specifically investigated two deep generative modeling approaches, VAEs and DBMs. First, we considered single-cell variational inference (scVI) in two variants, generating samples from the posterior distribution, the standard approach, or the prior distribution. Second, we propose single-cell deep Boltzmann machines (scDBMs). When considering the similarity of clustering results on synthetic data to ground-truth clustering, we find that the $$scVI_{posterior}$$ s c V I posterior variant resulted in high variability, most likely due to amplifying artifacts of small datasets. All approaches showed mixed results for cell types with different abundance by overrepresenting highly abundant cell types and missing less abundant cell types. With increasing pilot dataset sizes, the proportions of the cells in each cluster became more similar to that of ground-truth data. We also showed that all approaches learn the univariate distribution of most genes, but problems occurred with bimodality. Across all analyses, in comparing 10$$\times$$ × Genomics and Smart-seq2 technologies, we could show that for 10$$\times$$ × datasets, which have higher sparsity, it is more challenging to make inference from small to larger datasets. Overall, the results show that generative deep learning approaches might be valuable for supporting the design of scRNA-seq experiments.

2020 ◽  
Author(s):  
Martin Treppner ◽  
Adrián Salas-Bastos ◽  
Moritz Hess ◽  
Stefan Lenz ◽  
Tanja Vogel ◽  
...  

ABSTRACTDeep generative models, such as variational autoencoders (VAEs) or deep Boltzmann machines (DBM), can generate an arbitrary number of synthetic observations after being trained on an initial set of samples. This has mainly been investigated for imaging data but could also be useful for single-cell transcriptomics (scRNA-seq). A small pilot study could be used for planning a full-scale study by investigating planned analysis strategies on synthetic data with different sample sizes. It is unclear whether synthetic observations generated based on a small scRNA-seq dataset reflect the properties relevant for subsequent data analysis steps.We specifically investigated two deep generative modeling approaches, VAEs and DBMs. First, we considered single-cell variational inference (scVI) in two variants, generating samples from the posterior distribution, the standard approach, or the prior distribution. Second, we propose single-cell deep Boltzmann machines (scDBM). When considering the similarity of clustering results on synthetic data to ground-truth clustering, we find that the scVIposterior variant resulted in high variability, most likely due to amplifying artifacts of small data sets. All approaches showed mixed results for cell types with different abundance by overrepresenting highly abundant cell types and missing less abundant cell types. With increasing pilot dataset sizes, the proportions of the cells in each cluster became more similar to that of ground-truth data. We also showed that all approaches learn the univariate distribution of most genes, but problems occurred with bimodality. Overall, the results showed that generative deep learning approaches might be valuable for supporting the design of scRNA-seq experiments.


Author(s):  
Peter Rupprecht ◽  
Stefano Carta ◽  
Adrian Hoffmann ◽  
Mayumi Echizen ◽  
Kazuo Kitamura ◽  
...  

ABSTRACTCalcium imaging is a key method to record patterns of neuronal activity across populations of identified neurons. Inference of temporal patterns of action potentials (‘spikes’) from calcium signals is, however, challenging and often limited by the scarcity of ground truth data containing simultaneous measurements of action potentials and calcium signals. To overcome this problem, we compiled a large and diverse ground truth database from publicly available and newly performed recordings. This database covers various types of calcium indicators, cell types, and signal-to-noise ratios and comprises a total of >20 hours from 225 neurons. We then developed a novel algorithm for spike inference (CASCADE) that is based on supervised deep networks, takes advantage of the ground truth database, infers absolute spike rates, and outperforms existing model-based algorithms. To optimize performance for unseen imaging data, CASCADE retrains itself by resampling ground truth data to match the respective sampling rate and noise level. As a consequence, no parameters need to be adjusted by the user. To facilitate routine application of CASCADE we developed systematic performance assessments for unseen data, we openly release all resources, and we provide a user-friendly cloud-based implementation.


2021 ◽  
Vol 22 (1) ◽  
Author(s):  
Ruizhu Huang ◽  
Charlotte Soneson ◽  
Pierre-Luc Germain ◽  
Thomas S.B. Schmidt ◽  
Christian Von Mering ◽  
...  

AbstracttreeclimbR is for analyzing hierarchical trees of entities, such as phylogenies or cell types, at different resolutions. It proposes multiple candidates that capture the latent signal and pinpoints branches or leaves that contain features of interest, in a data-driven way. It outperforms currently available methods on synthetic data, and we highlight the approach on various applications, including microbiome and microRNA surveys as well as single-cell cytometry and RNA-seq datasets. With the emergence of various multi-resolution genomic datasets, treeclimbR provides a thorough inspection on entities across resolutions and gives additional flexibility to uncover biological associations.


2019 ◽  
Author(s):  
Chenling Xu ◽  
Romain Lopez ◽  
Edouard Mehlman ◽  
Jeffrey Regier ◽  
Michael I. Jordan ◽  
...  

AbstractAs single-cell transcriptomics becomes a mainstream technology, the natural next step is to integrate the accumulating data in order to achieve a common ontology of cell types and states. However, owing to various nuisance factors of variation, it is not straightforward how to compare gene expression levels across data sets and how to automatically assign cell type labels in a new data set based on existing annotations. In this manuscript, we demonstrate that our previously developed method, scVI, provides an effective and fully probabilistic approach for joint representation and analysis of cohorts of single-cell RNA-seq data sets, while accounting for uncertainty caused by biological and measurement noise. We also introduce single-cell ANnotation using Variational Inference (scANVI), a semi-supervised variant of scVI designed to leverage any available cell state annotations — for instance when only one data set in a cohort is annotated, or when only a few cells in a single data set can be labeled using marker genes. We demonstrate that scVI and scANVI compare favorably to the existing methods for data integration and cell state annotation in terms of accuracy, scalability, and adaptability to challenging settings such as a hierarchical structure of cell state labels. We further show that different from existing methods, scVI and scANVI represent the integrated datasets with a single generative model that can be directly used for any probabilistic decision making task, using differential expression as our case study. scVI and scANVI are available as open source software and can be readily used to facilitate cell state annotation and help ensure consistency and reproducibility across studies.


2021 ◽  
Vol 9 ◽  
Author(s):  
Cindy X. Chen ◽  
Han Sang Park ◽  
Hillel Price ◽  
Adam Wax

Holographic cytometry is an ultra-high throughput quantitative phase imaging modality that is capable of extracting subcellular information from millions of cells flowing through parallel microfluidic channels. In this study, we present our findings on the application of holographic cytometry to distinguishing carcinogen-exposed cells from normal cells and cancer cells. This has potential application for environmental monitoring and cancer detection by analysis of cytology samples acquired via brushing or fine needle aspiration. By leveraging the vast amount of cell imaging data, we are able to build single-cell-analysis-based biophysical phenotype profiles on the examined cell lines. Multiple physical characteristics of these cells show observable distinct traits between the three cell types. Logistic regression analysis provides insight on which traits are more useful for classification. Additionally, we demonstrate that deep learning is a powerful tool that can potentially identify phenotypic differences from reconstructed single-cell images. The high classification accuracy levels show the platform’s potential in being developed into a diagnostic tool for abnormal cell screening.


Author(s):  
I. Toschi ◽  
F. Remondino ◽  
R. Rothe ◽  
K. Klimek

<p><strong>Abstract.</strong> Hybrid sensor solutions, that feature active laser and passive image sensors on the same platform, are rapidly entering the airborne market of topographic and urban mapping, offering new opportunities for an improved quality of geo-spatial products. In this perspective, a concurrent acquisition of LiDAR data and oblique imagery, seems to have all the potential to lead the airborne (urban) mapping sector a step forward. This contribution focuses on the first commercial example of such an integrated, all-in-one mapping solution, namely the Leica CityMapper hybrid sensor. By analysing two CityMapper datasets acquired over the city of Heilbronn (Germany) and Bordeaux (France), the paper investigates potential and challenges, w.r.t. (i) number and distribution of tie points between nadir and oblique images, (ii) strategy for image aerial triangulation (AT) and accuracy achievable w.r.t ground truth data, (iii) local noise level and completeness of dense image matching (DIM) point clouds w.r.t LiDAR data. Solutions for an integrated processing of the concurrently acquired ranging and imaging data are proposed, that open new opportunities for exploiting the real potential of both data sources.</p>


2021 ◽  
Author(s):  
Tao Peng ◽  
Gregory M. Chen ◽  
Kai Tan

ABSTRACTSingle-cell omics assays have become essential tools for identifying and characterizing cell types and states of complex tissues. While each single-modality assay reveals distinctive features about the sequenced cells, true multi-omics assays are still in early stage of development. This notion signifies the importance of computationally integrating single-cell omics data that are conducted on various samples across various modalities. In addition, the advent of multiplexed molecular imaging assays has given rise to a need for computational methods for integrative analysis of single-cell imaging and omics data. Here, we present GLUER (inteGrative anaLysis of mUlti-omics at single-cEll Resolution), a flexible tool for integration of single-cell multi-omics data and imaging data. Using multiple true multi-omics data sets as the ground truth, we demonstrate that GLUER achieved significant improvement over existing methods in terms of the accuracy of matching cells across different data modalities resulting in ameliorating downstream analyses such as clustering and trajectory inference. We further demonstrate the broad utility of GLUER for integrating single-cell transcriptomics data with imaging-based spatial proteomics and transcriptomics data. Finally, we extend GLUER to leverage true cell-pair labels when available in true multi-omics data, and show that this approach improves co-embedding and clustering results. With the rapid accumulation of single-cell multi-omics and imaging data, integrated data holds the promise of furthering our understanding of the role of heterogeneity in development and disease.


2005 ◽  
Vol 17 (11) ◽  
pp. 2482-2507 ◽  
Author(s):  
Qi Zhao ◽  
David J. Miller

The goal of semisupervised clustering/mixture modeling is to learn the underlying groups comprising a given data set when there is also some form of instance-level supervision available, usually in the form of labels or pairwise sample constraints. Most prior work with constraints assumes the number of classes is known, with each learned cluster assumed to be a class and, hence, subject to the given class constraints. When the number of classes is unknown or when the one-cluster-per-class assumption is not valid, the use of constraints may actually be deleterious to learning the ground-truth data groups. We address this by (1) allowing allocation of multiple mixture components to individual classes and (2) estimating both the number of components and the number of classes. We also address new class discovery, with components void of constraints treated as putative unknown classes. For both real-world and synthetic data, our method is shown to accurately estimate the number of classes and to give favorable comparison with the recent approach of Shental, Bar-Hillel, Hertz, and Weinshall (2003).


2020 ◽  
Author(s):  
Almut Luetge ◽  
Joanna Zyprych-Walczak ◽  
Urszula Brykczynska Kunzmann ◽  
Helena L Crowell ◽  
Daniela Calini ◽  
...  

A key challenge in single cell RNA-sequencing (scRNA-seq) data analysis are dataset- and batch-specific differences that can obscure the biological signal of interest. While there are various tools and methods to perform data integration and correct for batch effects, their performance can vary between datasets and according to the nature of the bias. Therefore, it is important to understand how batch effects manifest in order to adjust for them in a reliable way. Here, we systematically explore batch effects in a variety of scRNA-seq datasets according to magnitude, cell type specificity and complexity. We developed a cell-specific mixing score (cms) that quantifies how well cells from multiple batches are mixed. By considering distance distributions (in a lower dimensional space), the score is able to detect local batch bias and differentiate between unbalanced batches (i.e., when one cell type is more abundant in a batch) and systematic differences between cells of the same cell type. We implemented cms and related metrics to detect batch effects or measure structure preservation in the CellMixS R/Bioconductor package. We systematically compare different metrics that have been proposed to quantify batch effects or bias in scRNA-seq data using real datasets with known batch effects and synthetic data that mimic various real data scenarios. While these metrics target the same question and are used interchangeably, we find differences in inter- and intra-dataset scalability, sensitivity and in a metric's ability to handle batch effects with differentially abundant cell types. We find that cell-specific metrics outperform cell type-specific and global metrics and recommend them for both method benchmarks and batch exploration.


2021 ◽  
Vol 12 ◽  
Author(s):  
John W. Hickey ◽  
Yuqi Tan ◽  
Garry P. Nolan ◽  
Yury Goltsev

Multiplexed imaging is a recently developed and powerful single-cell biology research tool. However, it presents new sources of technical noise that are distinct from other types of single-cell data, necessitating new practices for single-cell multiplexed imaging processing and analysis, particularly regarding cell-type identification. Here we created single-cell multiplexed imaging datasets by performing CODEX on four sections of the human colon (ascending, transverse, descending, and sigmoid) using a panel of 47 oligonucleotide-barcoded antibodies. After cell segmentation, we implemented five different normalization techniques crossed with four unsupervised clustering algorithms, resulting in 20 unique cell-type annotations for the same dataset. We generated two standard annotations: hand-gated cell types and cell types produced by over-clustering with spatial verification. We then compared these annotations at four levels of cell-type granularity. First, increasing cell-type granularity led to decreased labeling accuracy; therefore, subtle phenotype annotations should be avoided at the clustering step. Second, accuracy in cell-type identification varied more with normalization choice than with clustering algorithm. Third, unsupervised clustering better accounted for segmentation noise during cell-type annotation than hand-gating. Fourth, Z-score normalization was generally effective in mitigating the effects of noise from single-cell multiplexed imaging. Variation in cell-type identification will lead to significant differential spatial results such as cellular neighborhood analysis; consequently, we also make recommendations for accurately assigning cell-type labels to CODEX multiplexed imaging.


Sign in / Sign up

Export Citation Format

Share Document