CYBERTRACK2.0: zero-inflated model-based cell clustering and population tracking method for longitudinal mass cytometry data

Abstract Summary Recent advancements in high-dimensional single-cell technologies, such as mass cytometry, enable longitudinal experiments to track dynamics of cell populations and identify change points where the proportions vary significantly. However, current research is limited by the lack of tools specialized for analyzing longitudinal mass cytometry data. In order to infer cell population dynamics from such data, we developed a statistical framework named CYBERTRACK2.0. The framework’s analytic performance was validated against synthetic and real data, showing that its results are consistent with previous research. Availability and implementation CYBERTRACK2.0 is available at https://github.com/kodaim1115/CYBERTRACK2. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Detecting common breaks in the means of high dimensional cross-dependent panels

Econometrics Journal ◽

10.1093/ectj/utab028 ◽

2021 ◽

Author(s):

Lajos Horváth ◽

Zhenya Liu ◽

Gregory Rice ◽

Yuqian Zhao

Keyword(s):

Panel Data ◽

Common Factors ◽

Real Data ◽

Change Points ◽

High Dimensional ◽

Asymptotic Results ◽

Cross Sectional ◽

Data Set ◽

Monte Carlo Simulation Study ◽

Cross Sectional Dependence

Abstract The problem of detecting change points in the mean of high dimensional panel data with potentially strong cross–sectional dependence is considered. Under the assumption that the cross–sectional dependence is captured by an unknown number of common factors, a new CUSUM type statistic is proposed. We derive its asymptotic properties under three scenarios depending on to what extent the common factors are asymptotically dominant. With panel data consisting of N cross sectional time series of length T, the asymptotic results hold under the mild assumption that min {N, T} → ∞, with an otherwise arbitrary relationship between N and T, allowing the results to apply to most panel data examples. Bootstrap procedures are proposed to approximate the sampling distribution of the test statistics. A Monte Carlo simulation study showed that our test outperforms several other existing tests in finite samples in a number of cases, particularly when N is much larger than T. The practical application of the proposed results are demonstrated with real data applications to detecting and estimating change points in the high dimensional FRED-MD macroeconomic data set.

Download Full-text

CyTOFmerge: integrating mass cytometry data across multiple panels

Bioinformatics ◽

10.1093/bioinformatics/btz180 ◽

2019 ◽

Vol 35 (20) ◽

pp. 4063-4071 ◽

Cited By ~ 3

Author(s):

Tamim Abdelaal ◽

Thomas Höllt ◽

Vincent van Unen ◽

Boudewijn P F Lelieveldt ◽

Frits Koning ◽

...

Keyword(s):

Single Cell ◽

Biological Sample ◽

Supplementary Information ◽

High Dimensional ◽

Single Cell Level ◽

Supplementary Data ◽

Mass Cytometry ◽

Cell Level ◽

Cellular Markers

Abstract Motivation High-dimensional mass cytometry (CyTOF) allows the simultaneous measurement of multiple cellular markers at single-cell level, providing a comprehensive view of cell compositions. However, the power of CyTOF to explore the full heterogeneity of a biological sample at the single-cell level is currently limited by the number of markers measured simultaneously on a single panel. Results To extend the number of markers per cell, we propose an in silico method to integrate CyTOF datasets measured using multiple panels that share a set of markers. Additionally, we present an approach to select the most informative markers from an existing CyTOF dataset to be used as a shared marker set between panels. We demonstrate the feasibility of our methods by evaluating the quality of clustering and neighborhood preservation of the integrated dataset, on two public CyTOF datasets. We illustrate that by computationally extending the number of markers we can further untangle the heterogeneity of mass cytometry data, including rare cell-population detection. Availability and implementation Implementation is available on GitHub (https://github.com/tabdelaal/CyTOFmerge). Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

A sequential algorithm to detect diffusion switching along intracellular particle trajectories

Bioinformatics ◽

10.1093/bioinformatics/btz489 ◽

2019 ◽

Vol 36 (1) ◽

pp. 317-329 ◽

Cited By ~ 1

Author(s):

Vincent Briane ◽

Myriam Vimond ◽

Cesar Augusto Valades-Cruz ◽

Antoine Salomon ◽

Christian Wunder ◽

...

Keyword(s):

Window Size ◽

Monte Carlo Study ◽

Real Data ◽

Change Points ◽

Supplementary Information ◽

Sequential Algorithm ◽

Microscopy Imaging ◽

Neuronal Dendrites ◽

Matlab Package ◽

The Times

Abstract Motivation Recent advances in molecular biology and fluorescence microscopy imaging have made possible the inference of the dynamics of single molecules in living cells. Changes of dynamics can occur along a trajectory. Then, an issue is to estimate the temporal change-points that is the times at which a change of dynamics occurs. The number of points in the trajectory required to detect such changes will depend on both the magnitude and type of the motion changes. Here, the number of points per trajectory is of the order of 102, even if in practice dramatic motion changes can be detected with less points. Results We propose a non-parametric procedure based on test statistics computed on local windows along the trajectory to detect the change-points. This algorithm controls the number of false change-point detections in the case where the trajectory is fully Brownian. We also develop a strategy for aggregating the detections obtained with different window sizes so that the window size is no longer a parameter to optimize. A Monte Carlo study is proposed to demonstrate the performances of the method and also to compare the procedure to two competitive algorithms. At the end, we illustrate the efficacy of the method on real data in 2D and 3D, depicting the motion of mRNA complexes—called mRNA-binding proteins—in neuronal dendrites, Galectin-3 endocytosis and trafficking within the cell. Availability and implementation A user-friendly Matlab package containing examples and the code of the simulations used in the paper is available at http://serpico.rennes.inria.fr/doku.php? id=software:cpanalysis:index. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Compositional Data Analysis using Kernels in Mass Cytometry Data

10.1101/2021.05.08.443265 ◽

2021 ◽

Author(s):

Pratyaydipta Rudra ◽

Ryan Baxter ◽

Elena WY Hsieh ◽

Debashis Ghosh

Keyword(s):

Data Analysis ◽

Lupus Erythematosus ◽

Compositional Data ◽

Small Sample ◽

Supplementary Information ◽

High Dimensional ◽

Compositional Data Analysis ◽

Cell Type ◽

Mass Cytometry ◽

Abundance Data

Motivation: Cell type abundance data arising from mass cytometry experiments are compositional in nature. Classical association tests do not apply to the compositional data due to their non-Euclidean nature. Existing methods for analysis of cell type abundance data suffer from several limitations for high-dimensional mass cytometry data, especially when the sample size is small. Results: We proposed a new multivariate statistical learning methodology, Compositional Data Analysis using Kernels (CODAK), based on the kernel distance covariance (KDC) framework to test the association of the cell type compositions with important predictors (categorical or continuous) such as disease status. CODAK scales well for high-dimensional data and provides satisfactory performance for small sample sizes (n<25). We conducted simulation studies to compare the performance of the method with existing methods of analyzing cell type abundance data from mass cytometry studies. The method is also applied to a high-dimensional dataset containing different subgroups of populations including Systemic Lupus Erythematosus (SLE) patients and healthy control subjects. Availability and Implementation: CODAK is implemented using R. The codes and the data used in this manuscript are available on the web at http://github.com/GhoshLab/CODAK/. Supplementary information: Supplementary Materials.pdf.

Download Full-text

Analysis of variance when both input and output sets are high-dimensional

10.1101/2020.02.15.950949 ◽

2020 ◽

Author(s):

Gustavo de los Campos ◽

Torsten Pook ◽

Agustin Gonzalez-Raymundez ◽

Henner Simianer ◽

George Mias ◽

...

Keyword(s):

Gene Expression ◽

Linear Span ◽

Copy Number Variants ◽

Real Data ◽

Supplementary Information ◽

High Dimensional ◽

Supplementary Data ◽

Random Effects Models ◽

Input And Output ◽

Data Layers

AbstractMotivationModern genomic data sets often involve multiple data-layers (e.g., DNA-sequence, gene expression), each of which itself can be high-dimensional. The biological processes underlying these data-layers can lead to intricate multivariate association patterns.ResultsWe propose and evaluate two methods for analysis variance when both input and output sets are high-dimensional. Our approach uses random effects models to estimate the proportion of variance of vectors in the linear span of the output set that can be explained by regression on the input set. We consider a method based on orthogonal basis (Eigen-ANOVA) and one that uses random vectors (Monte Carlo ANOVA, MC-ANOVA) in the linear span of the output set. We used simulations to assess the bias and variance of each of the methods, and to compare it with that of the Partial Least Squares (PLS)–an approach commonly used in multivariate-high-dimensional regressions. The MC-ANOVA method gave nearly unbiased estimates in all the simulation scenarios considered. Estimates produced by Eigen-ANOVA and PLS had noticeable biases. Finally, we demonstrate insight that can be obtained with the of MC-ANOVA and Eigen-ANOVA by applying these two methods to the study of multi-locus linkage disequilibrium in chicken genomes and to the assessment of inter-dependencies between gene expression, methylation and copy-number-variants in data from breast cancer tumors.AvailabilityThe Supplementary data includes an R-implementation of each of the proposed methods as well as the scripts used in simulations and in the real-data [email protected] informationSupplementary data are available at Bioinformatics online.

Download Full-text

MetaTX: deciphering the distribution of mRNA-related features in the presence of isoform ambiguity, with applications in epitranscriptome analysis

Bioinformatics ◽

10.1093/bioinformatics/btaa938 ◽

2020 ◽

Author(s):

Yue Wang ◽

Kunqi Chen ◽

Zhen Wei ◽

Frans Coenen ◽

Jionglong Su ◽

...

Keyword(s):

Distribution Pattern ◽

Direct Methods ◽

Real Data ◽

R Package ◽

Supplementary Information ◽

Biological Features ◽

Statistical Framework ◽

Mrna Transcripts ◽

Functional Relevance ◽

Compositional Diversity

Abstract Motivation The distribution of biological features strongly indicates their functional relevance. Compared to DNA-related features, deciphering the distribution of mRNA-related features is non-trivial due to the existence of isoform ambiguity and compositional diversity of mRNAs. Results We propose here a rigorous statistical framework, MetaTX, for deciphering the distribution of mRNA-related features. Through a standardized mRNA model, MetaTX firstly unifies various mRNA transcripts of diverse compositions, and then corrects the isoform ambiguity by incorporating the overall distribution pattern of the features through an EM algorithm. MetaTX was tested on both simulated and real data. Results suggested that MetaTX substantially outperformed existing direct methods on simulated datasets, and that a more informative distribution pattern was produced for all the three datasets tested, which contain N6-Methyladenosine sites generated by different technologies. MetaTX should make a useful tool for studying the distribution and functions of mRNA-related biological features, especially for mRNA modifications such as N6-Methyladenosine. Availability and implementation The MetaTX R package is freely available at GitHub: https://github.com/yue-wang-biomath/MetaTX.1.0. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Statistical test of structured continuous trees based on discordance matrix

Bioinformatics ◽

10.1093/bioinformatics/btz425 ◽

2019 ◽

Vol 35 (23) ◽

pp. 4962-4970

Author(s):

Xiangqi Bai ◽

Liang Ma ◽

Lin Wan

Keyword(s):

Single Cell ◽

Cell Fate ◽

Matrix Theory ◽

Continuous Process ◽

Cell Types ◽

Supplementary Information ◽

High Dimensional ◽

Intrinsic Structure ◽

Statistical Framework ◽

Cell Data

Abstract Motivation Cell fate determination is a continuous process in which one cell type diversifies to other cell types following a hierarchical path. Advancements in single-cell technologies provide the opportunity to reveal the continuum of cell progression which forms a structured continuous tree (SCTree). Computational algorithms, which are usually based on a priori assumptions on the hidden structures, have previously been proposed as a means of recovering pseudo trajectory along cell differentiation process. However, there still lack of statistical framework on the assessments of intrinsic structure embedded in high-dimensional gene expression profile. Inherit noise and cell-to-cell variation underlie the single-cell data, however, pose grand challenges to testing even basic structures, such as linear versus bifurcation. Results In this study, we propose an adaptive statistical framework, termed SCTree, to test the intrinsic structure of a high-dimensional single-cell dataset. SCTree test is conducted based on the tools derived from metric geometry and random matrix theory. In brief, by extending the Gromov–Farris transform and utilizing semicircular law, we formulate the continuous tree structure testing problem into a signal matrix detection problem. We show that the SCTree test is most powerful when the signal-to-noise ratio exceeds a moderate value. We also demonstrate that SCTree is able to robustly detect linear, single and multiple branching events with simulated datasets and real scRNA-seq datasets. Overall, the SCTree test provides a unified statistical assessment of the significance of the hidden structure of single-cell data. Availability and implementation SCTree software is available at https://github.com/XQBai/SCTree-test. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

ShinySOM: graphical SOM-based analysis of single-cell cytometry data

Bioinformatics ◽

10.1093/bioinformatics/btaa091 ◽

2020 ◽

Vol 36 (10) ◽

pp. 3288-3289

Author(s):

Miroslav Kratochvíl ◽

David Bednárek ◽

Tomáš Sieger ◽

Karel Fišer ◽

Jiří Vondrášek

Keyword(s):

Single Cell ◽

High Throughput ◽

Statistical Information ◽

Supplementary Information ◽

High Dimensional ◽

Supplementary Data ◽

Mass Cytometry ◽

High Throughput Analysis ◽

Self Organizing Maps ◽

User Friendly

Abstract Summary ShinySOM offers a user-friendly interface for reproducible, high-throughput analysis of high-dimensional flow and mass cytometry data guided by self-organizing maps. The software implements a FlowSOM-style workflow, with improvements in performance, visualizations and data dissection possibilities. The outputs of the analysis include precise statistical information about the dissected samples, and R-compatible metadata useful for the batch processing of large sample volumes. Availability and implementation ShinySOM is free and open-source, available online at gitlab.com/exaexa/ShinySOM. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

sepal: identifying transcript profiles with spatial patterns by diffusion-based modeling

Bioinformatics ◽

10.1093/bioinformatics/btab164 ◽

2021 ◽

Author(s):

Alma Andersson ◽

Joakim Lundeberg

Keyword(s):

Spatial Patterns ◽

Expression Profiles ◽

Synthetic Data ◽

Real Data ◽

Cell Types ◽

Statistical Hypothesis ◽

Supplementary Information ◽

Statistical Hypothesis Testing ◽

Transcriptomics Data ◽

Transcript Profiles

Abstract Motivation Collection of spatial signals in large numbers has become a routine task in multiple omics-fields, but parsing of these rich datasets still pose certain challenges. In whole or near-full transcriptome spatial techniques, spurious expression profiles are intermixed with those exhibiting an organized structure. To distinguish profiles with spatial patterns from the background noise, a metric that enables quantification of spatial structure is desirable. Current methods designed for similar purposes tend to be built around a framework of statistical hypothesis testing, hence we were compelled to explore a fundamentally different strategy. Results We propose an unexplored approach to analyze spatial transcriptomics data, simulating diffusion of individual transcripts to extract genes with spatial patterns. The method performed as expected when presented with synthetic data. When applied to real data, it identified genes with distinct spatial profiles, involved in key biological processes or characteristic for certain cell types. Compared to existing methods, ours seemed to be less informed by the genes’ expression levels and showed better time performance when run with multiple cores. Availabilityand implementation Open-source Python package with a command line interface (CLI), freely available at https://github.com/almaan/sepal under an MIT licence. A mirror of the GitHub repository can be found at Zenodo, doi: 10.5281/zenodo.4573237. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text