beachmat: a Bioconductor C++ API for accessing single-cell genomics data from a variety of R matrix types

AbstractRecent advances in single-cell RNA sequencing have dramatically increased the number of cells that can be profiled in a single experiment. This provides unparalleled resolution to study cellular heterogeneity within biological processes such as differentiation. However, the explosion of data that are generated from such experiments poses a challenge to the existing computational infrastructure for statistical data analysis. In particular, large matrices holding expression values for each gene in each cell require sparse or file-backed representations for manipulation with the popular R programming language. Here, we describe a C++ interface named beachmat, which enables agnostic data access from various matrix representations. This allows package developers to write efficient C++ code that is interoperable with simple, sparse and HDF5-backed matrices, amongst others. We perform simulations to examine the performance of beachmat on each matrix representation, and we demonstrate how beachmat can be incorporated into the code of other packages to drive analyses of a very large single-cell data set.

Download Full-text

SpatialExperiment: infrastructure for spatially resolved transcriptomics data in R using Bioconductor

10.1101/2021.01.27.428431 ◽

2021 ◽

Author(s):

Dario Righelli ◽

Lukas M. Weber ◽

Helena L. Crowell ◽

Brenda Pardo ◽

Leonardo Collado-Torres ◽

...

Keyword(s):

Single Cell ◽

Spatial Information ◽

Data Infrastructure ◽

R Programming Language ◽

Spatially Resolved ◽

Transcriptomics Data ◽

Visualization Tools ◽

R Programming ◽

Cell Data ◽

Technological Platforms

AbstractMotivationSpatially resolved transcriptomics is a new set of technologies to measure gene expression for up to thousands of genes at near-single-cell, single-cell, or sub-cellular resolution, together with the spatial positions of the measurements. Analyzing combined molecular and spatial information has generated new insights about biological processes that manifest in a spatial manner within tissues. However, to efficiently analyze these data, specialized data infrastructure is required, which facilitates storage, retrieval, subsetting, and interfacing with downstream tools.ResultsHere, we describe SpatialExperiment, a new data infrastructure for storing and accessing spatially resolved transcriptomics data, implemented within the Bioconductor framework in the R programming language. SpatialExperiment extends the existing SingleCellExperiment for single-cell data from the Bioconductor framework, which brings with it advantages of modularity, interoperability, standardized operations, and comprehensive documentation. We demonstrate the structure and user interface with examples from the 10x Genomics Visium and seqFISH platforms. SpatialExperiment is extendable to alternative technological platforms measuring expression and to new types of data modalities, such as spatial immunofluorescence or proteomics, in the future. We also provide access to example datasets and visualization tools in the STexampleData, TENxVisiumData, and ggspavis packages.Availability and ImplementationSpatialExperiment is freely available from Bioconductor at https://bioconductor.org/packages/SpatialExperiment. The STexampleData, TENxVisiumData, and ggspavis packages are available from GitHub and will be submitted to Bioconductor.

Download Full-text

PARC: ultrafast and accurate clustering of phenotypic data of millions of single cells

10.1101/765628 ◽

2019 ◽

Author(s):

Shobana V. Stassen ◽

Dickson M. D. Siu ◽

Kelvin C. M. Lee ◽

Joshua W. K. Ho ◽

Hayden K. H. So ◽

...

Keyword(s):

Single Cell ◽

Large Scale ◽

Clustering Algorithm ◽

Single Cells ◽

Clustering Algorithms ◽

Cell Mass ◽

Cellular Heterogeneity ◽

Phenotypic Data ◽

Data Set ◽

Cell Data

AbstractMotivationNew single-cell technologies continue to fuel the explosive growth in the scale of heterogeneous single-cell data. However, existing computational methods are inadequately scalable to large datasets and therefore cannot uncover the complex cellular heterogeneity.ResultsWe introduce a highly scalable graph-based clustering algorithm PARC - phenotyping by accelerated refined community-partitioning – for ultralarge-scale, high-dimensional single-cell data (> 1 million cells). Using large single cell mass cytometry, RNA-seq and imaging-based biophysical data, we demonstrate that PARC consistently outperforms state-of-the-art clustering algorithms without sub-sampling of cells, including Phenograph, FlowSOM, and Flock, in terms of both speed and ability to robustly detect rare cell populations. For example, PARC can cluster a single cell data set of 1.1M cells within 13 minutes, compared to >2 hours to the next fastest graph-clustering algorithm, Phenograph. Our work presents a scalable algorithm to cope with increasingly large-scale single-cell analysis.Availability and Implementationhttps://github.com/ShobiStassen/PARC

Download Full-text

VoPo leverages cellular heterogeneity for predictive modeling of single-cell data

Nature Communications ◽

10.1038/s41467-020-17569-8 ◽

2020 ◽

Vol 11 (1) ◽

Cited By ~ 2

Author(s):

Natalie Stanley ◽

Ina A. Stelzer ◽

Amy S. Tsai ◽

Ramin Fallahzadeh ◽

Edward Ganio ◽

...

Keyword(s):

Single Cell ◽

Predictive Modeling ◽

Cellular Heterogeneity ◽

Cell Data

Download Full-text

CIM-seq

10.21203/rs.3.pex-1365/v1 ◽

2021 ◽

Author(s):

Nathanael Andrews ◽

Martin Enge

Keyword(s):

Single Cell ◽

Single Cells ◽

Likelihood Estimation ◽

Cell Types ◽

Data Sets ◽

Target Tissue ◽

Data Set ◽

Rnaseq Data ◽

The Given ◽

Cell Data

Abstract CIM-seq is a tool for deconvoluting RNA-seq data from cell multiplets (clusters of two or more cells) in order to identify physically interacting cell in a given tissue. The method requires two RNAseq data sets from the same tissue: one of single cells to be used as a reference, and one of cell multiplets to be deconvoluted. CIM-seq is compatible with both droplet based sequencing methods, such as Chromium Single Cell 3′ Kits from 10x genomics; and plate based methods, such as Smartseq2. The pipeline consists of three parts: 1) Dissociation of the target tissue, FACS sorting of single cells and multiplets, and conventional scRNA-seq 2) Feature selection and clustering of cell types in the single cell data set - generating a blueprint of transcriptional profiles in the given tissue 3) Computational deconvolution of multiplets through a maximum likelihood estimation (MLE) to determine the most likely cell type constituents of each multiplet.

Download Full-text

scTriangulate: Decision-level integration of multimodal single-cell data

10.1101/2021.10.16.464640 ◽

2021 ◽

Author(s):

Guangyuan Li ◽

Song Baobao ◽

H. L Grimes ◽

V. B. Surya Prasath ◽

Nathan L Salomonis

Keyword(s):

Single Cell ◽

Cooperative Game Theory ◽

Cellular Heterogeneity ◽

Cell Populations ◽

Multimodal Analysis ◽

Distinct Cell ◽

Cell Data ◽

Mix And Match ◽

Computational Strategy ◽

Statistical Metrics

Hundreds of bioinformatics approaches now exist to define cellular heterogeneity from single-cell genomics data. Reconciling conflicts between diverse methods, algorithm settings, annotations or modalities have the potential to clarify which populations are real and establish reusable reference atlases. Here, we present a customizable computational strategy called scTrianguate, which leverages cooperative game theory to intelligently mix-and-match clustering solutions from different resolutions, algorithms, reference atlases, or multi-modal measurements. This algorithm relies on a series of robust statistical metrics for cluster stability that work across molecular modalities to identify high-confidence integrated annotations. When applied to annotations from diverse competing cell atlas projects, this approach is able to resolve conflicts and determine the validity of controversial cell population predictions. Tested with scRNA-Seq, CITE-Seq (RNA + surface ADT), multiome (RNA + ATAC), and TEA-Seq (RNA + surface ADT + ATAC), this approach identifies highly stable and reproducible, known and novel cell populations, while excluding clusters defined by technical artifacts (i.e., doublets). Importantly, we find that distinct cell populations are frequently attributed with features from different modalities (RNA, ATAC, ADT) in the same assay, highlighting the importance of multimodal analysis in cluster determination. As it is flexible, this approach can be updated with new user-defined statistical metrics to alter the decision engine and customized to new measures of stability for different measures of cellular activity.

Download Full-text

Supervised dimensionality reduction for exploration of single-cell data by Hybrid Subset Selection - Linear Discriminant Analysis

10.1101/2022.01.06.475279 ◽

2022 ◽

Author(s):

Meelad Amouzgar ◽

David R Glass ◽

Reema Baskar ◽

Inna Averbukh ◽

Samuel C Kimmey ◽

...

Keyword(s):

Discriminant Analysis ◽

Dimensionality Reduction ◽

Linear Discriminant Analysis ◽

Single Cell ◽

Cell Mass ◽

Subset Selection ◽

Cellular Heterogeneity ◽

Linear Discriminant ◽

Original Dataset ◽

Cell Data

Single-cell technologies generate large, high-dimensional datasets encompassing a diversity of omics. Dimensionality reduction enables visualization of data by representing cells in two-dimensional plots that capture the structure and heterogeneity of the original dataset. Visualizations contribute to human understanding of data and are useful for guiding both quantitative and qualitative analysis of cellular relationships. Existing algorithms are typically unsupervised, utilizing only measured features to generate manifolds, disregarding known biological labels such as cell type or experimental timepoint. Here, we repurpose the classification algorithm, linear discriminant analysis (LDA), for supervised dimensionality reduction of single-cell data. LDA identifies linear combinations of predictors that optimally separate a priori classes, enabling users to tailor visualizations to separate specific aspects of cellular heterogeneity. We implement feature selection by hybrid subset selection (HSS) and demonstrate that this flexible, computationally-efficient approach generates non-stochastic, interpretable axes amenable to diverse biological processes, such as differentiation over time and cell cycle. We benchmark HSS-LDA against several popular dimensionality reduction algorithms and illustrate its utility and versatility for exploration of single-cell mass cytometry, transcriptomics and chromatin accessibility data.

Download Full-text

Data Analytics in Dentistry Using R Programming Software

Journal of Immunology and Allergy ◽

10.37191/mapsci-2582-4333-3(1)-056 ◽

2021 ◽

Author(s):

Sriram Thirugnanam

Keyword(s):

Full Potential ◽

Dental Practice ◽

Data Set ◽

R Programming Language ◽

Clinical Dataset ◽

Health Authorities ◽

Dental Practices ◽

Programming Software ◽

R Programming ◽

Descriptive Statistical Analysis

Dental practices collect numerous amounts of clinical and non-clinical data from their patients. Whether that data has been utilized to its full potential is highly questionable. This study used the R programming language on a five-year simulated dental clinical dataset to statistically analyze various possibilities to improve clinical practice and promote awareness among patients. The data set consists of all possible dental treatments which is offered in routine dental practice. The analysis is based on a single dental practice, unlike yearly statistics published by the health authorities over the entire county or country health data, which cannot address unique requirements and challenges associated with every individual practice and community. Descriptive statistical analysis of the dataset is performed through histograms, scattered plots, and test to normality along with correlation analysis with the plot (Pearson/Spearman depend from p.1) and compared variables with multiple regression analysis, forecasting and finally estimated the accuracy using (MAE, MAPE, R_squared ) and k-fold cv.

Download Full-text

Quasi-universality in single-cell sequencing data

10.1101/426239 ◽

2018 ◽

Cited By ~ 2

Author(s):

Luis Aparicio ◽

Mykola Bordyuh ◽

Andrew J. Blumberg ◽

Raul Rabadan

Keyword(s):

Single Cell ◽

Matrix Theory ◽

Biological Information ◽

Sequencing Data ◽

Data Set ◽

Single Cell Sequencing ◽

Marked Cell ◽

Eigenvector Localization ◽

Cell Data ◽

Epigenetic Processes

ABSTRACTThe development of single-cell technologies provides the opportunity to identify new cellular states and reconstruct novel cell-to-cell relationships. Applications range from understanding the transcriptional and epigenetic processes involved in metazoan development to characterizing distinct cells types in heterogeneous populations like cancers or immune cells. However, analysis of the data is impeded by its unknown intrinsic biological and technical variability together with its sparseness; these factors complicate the identification of true biological signals amidst artifact and noise. Here we show that, across technologies, roughly 95% of the eigenvalues derived from each single-cell data set can be described by universal distributions predicted by Random Matrix Theory. Interestingly, 5% of the spectrum shows deviations from these distributions and present a phenomenon known as eigenvector localization, where information tightly concentrates in groups of cells. Some of the localized eigenvectors reflect underlying biological signal, and some are simply a consequence of the sparsity of single cell data; roughly 3% is artifactual. Based on the universal distributions and a technique for detecting sparsity induced localization, we present a strategy to identify the residual 2% of directions that encode biological information and thereby denoise single-cell data. We demonstrate the effectiveness of this approach by comparing with standard single-cell data analysis techniques in a variety of examples with marked cell populations.

Download Full-text

Meeting the challenges of high-dimensional data analysis in immunology

10.1101/473215 ◽

2018 ◽

Cited By ~ 1

Author(s):

Subarna Palit ◽

Fabian J. Theis ◽

Christina E. Zielinski

Keyword(s):

Data Analysis ◽

Single Cell ◽

Immune Regulation ◽

Cellular Heterogeneity ◽

High Dimensionality ◽

High Dimensional ◽

Mass Cytometry ◽

Analysis Techniques ◽

The Impact ◽

Cell Data

AbstractRecent advances in cytometry have radically altered the fate of single-cell proteomics by allowing a more accurate understanding of complex biological systems. Mass cytometry (CyTOF) provides simultaneous single-cell measurements that are crucial to understand cellular heterogeneity and identify novel cellular subsets. High-dimensional CyTOF data were traditionally analyzed by gating on bivariate dot plots, which are not only laborious given the quadratic increase of complexity with dimension but are also biased through manual gating. This review aims to discuss the impact of new analysis techniques for in-depths insights into the dynamics of immune regulation obtained from static snapshot data and to provide tools to immunologists to address the high dimensionality of their single-cell data.

Download Full-text

Clustering Deviation Index (CDI): A robust and accurate unsupervised measure for evaluating scRNA-seq data clustering

10.1101/2022.01.03.474840 ◽

2022 ◽

Author(s):

Jiyuan Fang ◽

Cliburn Chan ◽

Kouros Owzar ◽

Liuyang Wang ◽

Diyuan Qin ◽

...

Keyword(s):

Single Cell ◽

Data Clustering ◽

Goodness Of Fit ◽

Cellular Heterogeneity ◽

Clustering Methods ◽

Tuning Parameters ◽

Deviation Index ◽

Cell Clustering ◽

Single Cell Rna Sequencing ◽

Cell Data

Single-cell RNA-sequencing (scRNA-seq) technology allows us to explore cellular heterogeneity in the transcriptome. Because most scRNA-seq data analyses begin with cell clustering, its accuracy considerably impacts the validity of downstream analyses. Although many clustering methods have been developed, few tools are available to evaluate the clustering "goodness-of-fit" to the scRNA-seq data. In this paper, we propose a new Clustering Deviation Index (CDI) that measures the deviation of any clustering label set from the observed single-cell data. We conduct in silico and experimental scRNA-seq studies to show that CDI can select the optimal clustering label set. Particularly, CDI also informs the optimal tuning parameters for any given clustering method and the correct number of cluster components.

Download Full-text