A sequential algorithm to detect diffusion switching along intracellular particle trajectories

Abstract Motivation Recent advances in molecular biology and fluorescence microscopy imaging have made possible the inference of the dynamics of single molecules in living cells. Changes of dynamics can occur along a trajectory. Then, an issue is to estimate the temporal change-points that is the times at which a change of dynamics occurs. The number of points in the trajectory required to detect such changes will depend on both the magnitude and type of the motion changes. Here, the number of points per trajectory is of the order of 102, even if in practice dramatic motion changes can be detected with less points. Results We propose a non-parametric procedure based on test statistics computed on local windows along the trajectory to detect the change-points. This algorithm controls the number of false change-point detections in the case where the trajectory is fully Brownian. We also develop a strategy for aggregating the detections obtained with different window sizes so that the window size is no longer a parameter to optimize. A Monte Carlo study is proposed to demonstrate the performances of the method and also to compare the procedure to two competitive algorithms. At the end, we illustrate the efficacy of the method on real data in 2D and 3D, depicting the motion of mRNA complexes—called mRNA-binding proteins—in neuronal dendrites, Galectin-3 endocytosis and trafficking within the cell. Availability and implementation A user-friendly Matlab package containing examples and the code of the simulations used in the paper is available at http://serpico.rennes.inria.fr/doku.php? id=software:cpanalysis:index. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

CYBERTRACK2.0: zero-inflated model-based cell clustering and population tracking method for longitudinal mass cytometry data

Bioinformatics ◽

10.1093/bioinformatics/btaa873 ◽

2020 ◽

Author(s):

Kodai Minoura ◽

Ko Abe ◽

Yuka Maeda ◽

Hiroyoshi Nishikawa ◽

Teppei Shimamura

Keyword(s):

Real Data ◽

Change Points ◽

Supplementary Information ◽

High Dimensional ◽

Mass Cytometry ◽

Cell Population Dynamics ◽

Tracking Method ◽

Statistical Framework ◽

Track Dynamics ◽

Cell Clustering

Abstract Summary Recent advancements in high-dimensional single-cell technologies, such as mass cytometry, enable longitudinal experiments to track dynamics of cell populations and identify change points where the proportions vary significantly. However, current research is limited by the lack of tools specialized for analyzing longitudinal mass cytometry data. In order to infer cell population dynamics from such data, we developed a statistical framework named CYBERTRACK2.0. The framework’s analytic performance was validated against synthetic and real data, showing that its results are consistent with previous research. Availability and implementation CYBERTRACK2.0 is available at https://github.com/kodaim1115/CYBERTRACK2. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

sepal: identifying transcript profiles with spatial patterns by diffusion-based modeling

Bioinformatics ◽

10.1093/bioinformatics/btab164 ◽

2021 ◽

Author(s):

Alma Andersson ◽

Joakim Lundeberg

Keyword(s):

Spatial Patterns ◽

Expression Profiles ◽

Synthetic Data ◽

Real Data ◽

Cell Types ◽

Statistical Hypothesis ◽

Supplementary Information ◽

Statistical Hypothesis Testing ◽

Transcriptomics Data ◽

Transcript Profiles

Abstract Motivation Collection of spatial signals in large numbers has become a routine task in multiple omics-fields, but parsing of these rich datasets still pose certain challenges. In whole or near-full transcriptome spatial techniques, spurious expression profiles are intermixed with those exhibiting an organized structure. To distinguish profiles with spatial patterns from the background noise, a metric that enables quantification of spatial structure is desirable. Current methods designed for similar purposes tend to be built around a framework of statistical hypothesis testing, hence we were compelled to explore a fundamentally different strategy. Results We propose an unexplored approach to analyze spatial transcriptomics data, simulating diffusion of individual transcripts to extract genes with spatial patterns. The method performed as expected when presented with synthetic data. When applied to real data, it identified genes with distinct spatial profiles, involved in key biological processes or characteristic for certain cell types. Compared to existing methods, ours seemed to be less informed by the genes’ expression levels and showed better time performance when run with multiple cores. Availabilityand implementation Open-source Python package with a command line interface (CLI), freely available at https://github.com/almaan/sepal under an MIT licence. A mirror of the GitHub repository can be found at Zenodo, doi: 10.5281/zenodo.4573237. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Detecting multiple generalized change-points by isolating single ones

Metrika ◽

10.1007/s00184-021-00821-6 ◽

2021 ◽

Author(s):

Andreas Anastasiou ◽

Piotr Fryzlewicz

Keyword(s):

Model Selection ◽

Linear Trend ◽

Information Criterion ◽

Change Points ◽

Isolation Technique ◽

New Approach ◽

Piecewise Constant ◽

Practical Performance ◽

R Packages ◽

The Times

AbstractWe introduce a new approach, called Isolate-Detect (ID), for the consistent estimation of the number and location of multiple generalized change-points in noisy data sequences. Examples of signal changes that ID can deal with are changes in the mean of a piecewise-constant signal and changes, continuous or not, in the linear trend. The number of change-points can increase with the sample size. Our method is based on an isolation technique, which prevents the consideration of intervals that contain more than one change-point. This isolation enhances ID’s accuracy as it allows for detection in the presence of frequent changes of possibly small magnitudes. In ID, model selection is carried out via thresholding, or an information criterion, or SDLL, or a hybrid involving the former two. The hybrid model selection leads to a general method with very good practical performance and minimal parameter choice. In the scenarios tested, ID is at least as accurate as the state-of-the-art methods; most of the times it outperforms them. ID is implemented in the R packages IDetect and breakfast, available from CRAN.

Download Full-text

CNV-BAC: Copy number Variation Detection in Bacterial Circular Genome

Bioinformatics ◽

10.1093/bioinformatics/btaa208 ◽

2020 ◽

Vol 36 (12) ◽

pp. 3890-3891

Author(s):

Linjie Wu ◽

Han Wang ◽

Yuchao Xia ◽

Ruibin Xi

Keyword(s):

Copy Number Variation ◽

Copy Number ◽

Genome Structure ◽

Real Data ◽

Read Depth ◽

Supplementary Information ◽

Circular Genome ◽

Number Variation ◽

Copy Number Variation Detection ◽

Cnv Detection

Abstract Motivation Whole-genome sequencing (WGS) is widely used for copy number variation (CNV) detection. However, for most bacteria, their circular genome structure and high replication rate make reads more enriched near the replication origin. CNV detection based on read depth could be seriously influenced by such replication bias. Results We show that the replication bias is widespread using ∼200 bacterial WGS data. We develop CNV-BAC (CNV-Bacteria) that can properly normalize the replication bias and other known biases in bacterial WGS data and can accurately detect CNVs. Simulation and real data analysis show that CNV-BAC achieves the best performance in CNV detection compared with available algorithms. Availability and implementation CNV-BAC is available at https://github.com/XiDsLab/CNV-BAC. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Detection of differentially methylated CpG sites between tumor samples with uneven tumor purities

Bioinformatics ◽

10.1093/bioinformatics/btz885 ◽

2019 ◽

Vol 36 (7) ◽

pp. 2017-2024

Author(s):

Weiwei Zhang ◽

Ziyi Li ◽

Nana Wei ◽

Hua-Jun Wu ◽

Xiaoqi Zheng

Keyword(s):

Real Data ◽

R Package ◽

Differential Methylation ◽

Least Square ◽

Epigenetic Mechanism ◽

Supplementary Information ◽

Cpg Sites ◽

Tumor Purity ◽

Different Sources ◽

Normal Controls

Abstract Motivation Inference of differentially methylated (DM) CpG sites between two groups of tumor samples with different geno- or pheno-types is a critical step to uncover the epigenetic mechanism of tumorigenesis, and identify biomarkers for cancer subtyping. However, as a major source of confounding factor, uneven distributions of tumor purity between two groups of tumor samples will lead to biased discovery of DM sites if not properly accounted for. Results We here propose InfiniumDM, a generalized least square model to adjust tumor purity effect for differential methylation analysis. Our method is applicable to a variety of experimental designs including with or without normal controls, different sources of normal tissue contaminations. We compared our method with conventional methods including minfi, limma and limma corrected by tumor purity using simulated datasets. Our method shows significantly better performance at different levels of differential methylation thresholds, sample sizes, mean purity deviations and so on. We also applied the proposed method to breast cancer samples from TCGA database to further evaluate its performance. Overall, both simulation and real data analyses demonstrate favorable performance over existing methods serving similar purpose. Availability and implementation InfiniumDM is a part of R package InfiniumPurify, which is freely available from GitHub (https://github.com/Xiaoqizheng/InfiniumPurify). Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Co-localization analysis in fluorescence microscopy via maximum entropy copula

The International Journal of Biostatistics ◽

10.1515/ijb-2019-0019 ◽

2020 ◽

Vol 0 (0) ◽

Author(s):

Zahra Amini Farsani ◽

Volker J. Schmid

Keyword(s):

Fluorescence Microscopy ◽

Maximum Entropy ◽

Probability Distributions ◽

Real Data ◽

Bivariate Distribution ◽

Entropy Method ◽

Gaussian Copula ◽

Microscopy Imaging ◽

High Background ◽

Localization Analysis

AbstractCo-localization analysis is a popular method for quantitative analysis in fluorescence microscopy imaging. The localization of marked proteins in the cell nucleus allows a deep insight into biological processes in the nucleus. Several metrics have been developed for measuring the co-localization of two markers, however, they depend on subjective thresholding of background and the assumption of linearity. We propose a robust method to estimate the bivariate distribution function of two color channels. From this, we can quantify their co- or anti-colocalization. The proposed method is a combination of the Maximum Entropy Method (MEM) and a Gaussian Copula, which we call the Maximum Entropy Copula (MEC). This new method can measure the spatial and nonlinear correlation of signals to determine the marker colocalization in fluorescence microscopy images. The proposed method is compared with MEM for bivariate probability distributions. The new colocalization metric is validated on simulated and real data. The results show that MEC can determine co- and anti-colocalization even in high background settings. MEC can, therefore, be used as a robust tool for colocalization analysis.

Download Full-text

Detecting common breaks in the means of high dimensional cross-dependent panels

Econometrics Journal ◽

10.1093/ectj/utab028 ◽

2021 ◽

Author(s):

Lajos Horváth ◽

Zhenya Liu ◽

Gregory Rice ◽

Yuqian Zhao

Keyword(s):

Panel Data ◽

Common Factors ◽

Real Data ◽

Change Points ◽

High Dimensional ◽

Asymptotic Results ◽

Cross Sectional ◽

Data Set ◽

Monte Carlo Simulation Study ◽

Cross Sectional Dependence

Abstract The problem of detecting change points in the mean of high dimensional panel data with potentially strong cross–sectional dependence is considered. Under the assumption that the cross–sectional dependence is captured by an unknown number of common factors, a new CUSUM type statistic is proposed. We derive its asymptotic properties under three scenarios depending on to what extent the common factors are asymptotically dominant. With panel data consisting of N cross sectional time series of length T, the asymptotic results hold under the mild assumption that min {N, T} → ∞, with an otherwise arbitrary relationship between N and T, allowing the results to apply to most panel data examples. Bootstrap procedures are proposed to approximate the sampling distribution of the test statistics. A Monte Carlo simulation study showed that our test outperforms several other existing tests in finite samples in a number of cases, particularly when N is much larger than T. The practical application of the proposed results are demonstrated with real data applications to detecting and estimating change points in the high dimensional FRED-MD macroeconomic data set.

Download Full-text

Robust partial reference-free cell composition estimation from tissue expression

Bioinformatics ◽

10.1093/bioinformatics/btaa184 ◽

2020 ◽

Vol 36 (11) ◽

pp. 3431-3438

Author(s):

Ziyi Li ◽

Zhenxing Guo ◽

Ying Cheng ◽

Peng Jin ◽

Hao Wu

Keyword(s):

Expression Profiles ◽

Gene Expression Profiles ◽

Real Data ◽

Estimation Procedure ◽

Free Cell ◽

Biological Information ◽

Supplementary Information ◽

Tissue Samples ◽

Cell Composition ◽

Heterogeneous Tissues

Abstract Motivation In the analysis of high-throughput omics data from tissue samples, estimating and accounting for cell composition have been recognized as important steps. High cost, intensive labor requirements and technical limitations hinder the cell composition quantification using cell-sorting or single-cell technologies. Computational methods for cell composition estimation are available, but they are either limited by the availability of a reference panel or suffer from low accuracy. Results We introduce TOols for the Analysis of heterogeneouS Tissues TOAST/-P and TOAST/+P, two partial reference-free algorithms for estimating cell composition of heterogeneous tissues based on their gene expression profiles. TOAST/-P and TOAST/+P incorporate additional biological information, including cell-type-specific markers and prior knowledge of compositions, in the estimation procedure. Extensive simulation studies and real data analyses demonstrate that the proposed methods provide more accurate and robust cell composition estimation than existing methods. Availability and implementation The proposed methods TOAST/-P and TOAST/+P are implemented as part of the R/Bioconductor package TOAST at https://bioconductor.org/packages/TOAST. Contact [email protected] or [email protected] Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Efficient Change-Points Detection For Genomic Sequences Via Cumulative Segmented Regression

Bioinformatics ◽

10.1093/bioinformatics/btab685 ◽

2021 ◽

Author(s):

Shengji Jia ◽

Lei Shi

Keyword(s):

Change Point ◽

Serial Correlation ◽

Copy Number Variations ◽

Change Points ◽

Supplementary Information ◽

Genomic Sequences ◽

Segmented Regression ◽

Computationally Efficient ◽

R Program ◽

Point Estimator

Abstract Motivation Knowing the number and the exact locations of multiple change points in genomic sequences serves several biological needs. The cumulative segmented algorithm (cumSeg) has been recently proposed as a computationally efficient approach for multiple change-points detection, which is based on a simple transformation of data and provides results quite robust to model mis-specifications. However, the errors are also accumulated in the transformed model so that heteroscedasticity and serial correlation will show up, and thus the variations of the estimated change points will be quite different, while the locations of the change points should be of the same importance in the original genomic sequences. Results In this study, we develop two new change-points detection procedures in the framework of cumulative segmented regression. Simulations reveal that the proposed methods not only improve the efficiency of each change point estimator substantially but also provide the estimators with similar variations for all the change points. By applying these proposed algorithms to Coriel and SNP genotyping data, we illustrate their performance on detecting copy number variations. Supplementary information The proposed algorithms are implemented in R program and are available at Bioinformatics online.

Download Full-text

Block HSIC Lasso: model-free biomarker detection for ultra-high dimensional data

Bioinformatics ◽

10.1093/bioinformatics/btz333 ◽

2019 ◽

Vol 35 (14) ◽

pp. i427-i435 ◽

Cited By ~ 3

Author(s):

Héctor Climente-González ◽

Chloé-Agathe Azencott ◽

Samuel Kaski ◽

Makoto Yamada

Keyword(s):

Single Cell ◽

Rna Sequencing ◽

Association Studies ◽

Real Data ◽

Supplementary Information ◽

Genome Wide Association Studies ◽

Model Free ◽

Computational Overhead ◽

Single Cell Rna Sequencing ◽

Non Linear

AbstractMotivationFinding non-linear relationships between biomolecules and a biological outcome is computationally expensive and statistically challenging. Existing methods have important drawbacks, including among others lack of parsimony, non-convexity and computational overhead. Here we propose block HSIC Lasso, a non-linear feature selector that does not present the previous drawbacks.ResultsWe compare block HSIC Lasso to other state-of-the-art feature selection techniques in both synthetic and real data, including experiments over three common types of genomic data: gene-expression microarrays, single-cell RNA sequencing and genome-wide association studies. In all cases, we observe that features selected by block HSIC Lasso retain more information about the underlying biology than those selected by other techniques. As a proof of concept, we applied block HSIC Lasso to a single-cell RNA sequencing experiment on mouse hippocampus. We discovered that many genes linked in the past to brain development and function are involved in the biological differences between the types of neurons.Availability and implementationBlock HSIC Lasso is implemented in the Python 2/3 package pyHSICLasso, available on PyPI. Source code is available on GitHub (https://github.com/riken-aip/pyHSICLasso).Supplementary informationSupplementary data are available at Bioinformatics online.

Download Full-text