A distance based multisample test for high-dimensional compositional data with applications to the human microbiome

Abstract Background Compositional data refer to the data that lie on a simplex, which are common in many scientific domains such as genomics, geology and economics. As the components in a composition must sum to one, traditional tests based on unconstrained data become inappropriate, and new statistical methods are needed to analyze this special type of data. Results In this paper, we consider a general problem of testing for the compositional difference between K populations. Motivated by microbiome and metagenomics studies, where the data are often over-dispersed and high-dimensional, we formulate a well-posed hypothesis from a Bayesian point of view and suggest a nonparametric test based on inter-point distance to evaluate statistical significance. Unlike most existing tests for compositional data, our method does not rely on any data transformation, sparsity assumption or regularity conditions on the covariance matrix, but directly analyzes the compositions. Simulated data and two real data sets on the human microbiome are used to illustrate the promise of our method. Conclusions Our simulation studies and real data applications demonstrate that the proposed test is more sensitive to the compositional difference than the mean-based method, especially when the data are over-dispersed or zero-inflated. The proposed test is easy to implement and computationally efficient, facilitating its application to large-scale datasets.

Download Full-text

Constructing Large-Scale Genetic Maps Using an Evolutionary Strategy Algorithm

Genetics ◽

10.1093/genetics/165.4.2269 ◽

2003 ◽

Vol 165 (4) ◽

pp. 2269-2282

Author(s):

D Mester ◽

Y Ronin ◽

D Minkov ◽

E Nevo ◽

A Korol

Keyword(s):

Discrete Optimization ◽

High Performance ◽

Large Scale ◽

Simulated Data ◽

Real Data ◽

Genetic Maps ◽

Chromosome 1 ◽

Evolutionary Strategy ◽

Group A ◽

The One

Abstract This article is devoted to the problem of ordering in linkage groups with many dozens or even hundreds of markers. The ordering problem belongs to the field of discrete optimization on a set of all possible orders, amounting to n!/2 for n loci; hence it is considered an NP-hard problem. Several authors attempted to employ the methods developed in the well-known traveling salesman problem (TSP) for multilocus ordering, using the assumption that for a set of linked loci the true order will be the one that minimizes the total length of the linkage group. A novel, fast, and reliable algorithm developed for the TSP and based on evolution-strategy discrete optimization was applied in this study for multilocus ordering on the basis of pairwise recombination frequencies. The quality of derived maps under various complications (dominant vs. codominant markers, marker misclassification, negative and positive interference, and missing data) was analyzed using simulated data with ∼50-400 markers. High performance of the employed algorithm allows systematic treatment of the problem of verification of the obtained multilocus orders on the basis of computing-intensive bootstrap and/or jackknife approaches for detecting and removing questionable marker scores, thereby stabilizing the resulting maps. Parallel calculation technology can easily be adopted for further acceleration of the proposed algorithm. Real data analysis (on maize chromosome 1 with 230 markers) is provided to illustrate the proposed methodology.

Download Full-text

High-dimensional Log-Error-in-Variable Regression with Applications to Microbial Compositional Data Analysis

Biometrika ◽

10.1093/biomet/asab020 ◽

2021 ◽

Author(s):

Pixu Shi ◽

Yuchen Zhou ◽

Anru R Zhang

Keyword(s):

Data Analysis ◽

Compositional Data ◽

Estimation Error ◽

Real Data ◽

Upper And Lower Bounds ◽

High Dimensional ◽

Compositional Data Analysis ◽

Sequencing Data ◽

Contrast Model ◽

Critical Issues

Abstract In microbiome and genomic studies, the regression of compositional data has been a crucial tool for identifying microbial taxa or genes that are associated with clinical phenotypes. To account for the variation in sequencing depth, the classic log-contrast model is often used where read counts are normalized into compositions. However, zero read counts and the randomness in covariates remain critical issues. In this article, we introduce a surprisingly simple, interpretable, and efficient method for the estimation of compositional data regression through the lens of a novel high-dimensional log-error-in-variable regression model. The proposed method provides both corrections on sequencing data with possible overdispersion and simultaneously avoids any subjective imputation of zero read counts. We provide theoretical justifications with matching upper and lower bounds for the estimation error. The merit of the procedure is illustrated through real data analysis and simulation studies.

Download Full-text

Empirical likelihood for high-dimensional partially functional linear model

Random Matrices Theory and Application ◽

10.1142/s2010326320500173 ◽

2019 ◽

Vol 09 (04) ◽

pp. 2050017

Author(s):

Zhiqiang Jiang ◽

Zhensheng Huang ◽

Guoliang Fan

Keyword(s):

Linear Model ◽

Empirical Likelihood ◽

Likelihood Ratio Statistic ◽

Real Data ◽

Regression Coefficients ◽

High Dimensional ◽

Regularity Conditions ◽

Functional Linear Model ◽

Data Set ◽

Functional Predictors

This paper considers empirical likelihood inference for a high-dimensional partially functional linear model. An empirical log-likelihood ratio statistic is constructed for the regression coefficients of non-functional predictors and proved to be asymptotically normally distributed under some regularity conditions. Moreover, maximum empirical likelihood estimators of the regression coefficients of non-functional predictors are proposed and their asymptotic properties are obtained. Simulation studies are conducted to demonstrate the performance of the proposed procedure and a real data set is analyzed for illustration.

Download Full-text

Spatially Enhanced Differential RNA Methylation Analysis from Affinity-Based Sequencing Data with Hidden Markov Model

BioMed Research International ◽

10.1155/2015/852070 ◽

2015 ◽

Vol 2015 ◽

pp. 1-12 ◽

Cited By ~ 2

Author(s):

Yu-Chen Zhang ◽

Shao-Wu Zhang ◽

Lian Liu ◽

Hui Liu ◽

Lin Zhang ◽

...

Keyword(s):

Markov Model ◽

Hidden Markov Model ◽

Hidden Markov ◽

Statistical Significance ◽

Simulated Data ◽

Real Data ◽

Differential Methylation ◽

Methylation Site ◽

Sequencing Data ◽

Rna Methylation

With the development of new sequencing technology, the entire N6-methyl-adenosine (m6A) RNA methylome can now be unbiased profiled with methylated RNA immune-precipitation sequencing technique (MeRIP-Seq), making it possible to detect differential methylation states of RNA between two conditions, for example, between normal and cancerous tissue. However, as an affinity-based method, MeRIP-Seq has yet provided base-pair resolution; that is, a single methylation site determined from MeRIP-Seq data can in practice contain multiple RNA methylation residuals, some of which can be regulated by different enzymes and thus differentially methylated between two conditions. Since existing peak-based methods could not effectively differentiate multiple methylation residuals located within a single methylation site, we propose a hidden Markov model (HMM) based approach to address this issue. Specifically, the detected RNA methylation site is further divided into multiple adjacent small bins and then scanned with higher resolution using a hidden Markov model to model the dependency between spatially adjacent bins for improved accuracy. We tested the proposed algorithm on both simulated data and real data. Result suggests that the proposed algorithm clearly outperforms existing peak-based approach on simulated systems and detects differential methylation regions with higher statistical significance on real dataset.

Download Full-text

EFFICIENT INFERENCE OF HAPLOTYPES FROM GENOTYPES ON A PEDIGREE

Journal of Bioinformatics and Computational Biology ◽

10.1142/s0219720003000204 ◽

2003 ◽

Vol 01 (01) ◽

pp. 41-69 ◽

Cited By ~ 54

Author(s):

JING LI ◽

TAO JIANG

Keyword(s):

Large Scale ◽

Gaussian Elimination ◽

Linear Equations ◽

Simulated Data ◽

Exact Algorithm ◽

Real Data ◽

Haplotype Reconstruction ◽

Pedigree Data ◽

Simple Method ◽

Complexity Result

We study haplotype reconstruction under the Mendelian law of inheritance and the minimum recombination principle on pedigree data. We prove that the problem of finding a minimum-recombinant haplotype configuration (MRHC) is in general NP-hard. This is the first complexity result concerning the problem to our knowledge. An iterative algorithm based on blocks of consecutive resolved marker loci (called block-extension) is proposed. It is very efficient and can be used for large pedigrees with a large number of markers, especially for those data sets requiring few recombinants (or recombination events). A polynomial-time exact algorithm for haplotype reconstruction without recombinants is also presented. This algorithm first identifies all the necessary constraints based on the Mendelian law and the zero recombinant assumption, and represents them using a system of linear equations over the cyclic group Z2. By using a simple method based on Gaussian elimination, we could obtain all possible feasible haplotype configurations. A C++ implementation of the block-extension algorithm, called PedPhase, has been tested on both simulated data and real data. The results show that the program performs very well on both types of data and will be useful for large scale haplotype inference projects.

Download Full-text

Unsupervised discovery of temporal sequences in high-dimensional datasets, with applications to neuroscience

10.1101/273128 ◽

2018 ◽

Cited By ~ 2

Author(s):

Emily L. Mackevicius ◽

Andrew H. Bahle ◽

Alex H. Williams ◽

Shijie Gu ◽

Natalia I. Denissenko ◽

...

Keyword(s):

Large Scale ◽

Temporal Structure ◽

Simulated Data ◽

Salient Feature ◽

High Dimensional ◽

Neural Data ◽

Neural Recordings ◽

Reduction Techniques ◽

Low Dimensional ◽

Neural Sequences

AbstractIdentifying low-dimensional features that describe large-scale neural recordings is a major challenge in neuroscience. Repeated temporal patterns (sequences) are thought to be a salient feature of neural dynamics, but are not succinctly captured by traditional dimensionality reduction techniques. Here we describe a software toolbox—called seqNMF—with new methods for extracting informative, non-redundant, sequences from high-dimensional neural data, testing the significance of these extracted patterns, and assessing the prevalence of sequential structure in data. We test these methods on simulated data under multiple noise conditions, and on several real neural and behavioral data sets. In hippocampal data, seqNMF identifies neural sequences that match those calculated manually by reference to behavioral events. In songbird data, seqNMF discovers neural sequences in untutored birds that lack stereotyped songs. Thus, by identifying temporal structure directly from neural data, seqNMF enables dissection of complex neural circuits without relying on temporal references from stimuli or behavioral outputs.

Download Full-text

The Induced Smoothed lasso: A practical framework for hypothesis testing in high dimensional regression

Statistical Methods in Medical Research ◽

10.1177/0962280219842890 ◽

2019 ◽

Vol 29 (3) ◽

pp. 765-777 ◽

Cited By ~ 2

Author(s):

Giovanna Cilluffo ◽

Gianluca Sottile ◽

Stefania La Grutta ◽

Vito MR Muggeo

Keyword(s):

Hypothesis Testing ◽

Statistical Significance ◽

Real Data ◽

Regression Coefficients ◽

High Dimensional ◽

Lasso Regression ◽

Wald Statistic ◽

Simulation Experiments ◽

P Values ◽

Data Analyses

This paper focuses on hypothesis testing in lasso regression, when one is interested in judging statistical significance for the regression coefficients in the regression equation involving a lot of covariates. To get reliable p-values, we propose a new lasso-type estimator relying on the idea of induced smoothing which allows to obtain appropriate covariance matrix and Wald statistic relatively easily. Some simulation experiments reveal that our approach exhibits good performance when contrasted with the recent inferential tools in the lasso framework. Two real data analyses are presented to illustrate the proposed framework in practice.

Download Full-text

Unsupervised discovery of temporal sequences in high-dimensional datasets, with applications to neuroscience

eLife ◽

10.7554/elife.38471 ◽

2019 ◽

Vol 8 ◽

Cited By ~ 26

Author(s):

Emily L Mackevicius ◽

Andrew H Bahle ◽

Alex H Williams ◽

Shijie Gu ◽

Natalia I Denisenko ◽

...

Keyword(s):

Large Scale ◽

Temporal Structure ◽

Simulated Data ◽

Salient Feature ◽

High Dimensional ◽

Neural Data ◽

Neural Recordings ◽

Reduction Techniques ◽

Low Dimensional ◽

Neural Sequences

Identifying low-dimensional features that describe large-scale neural recordings is a major challenge in neuroscience. Repeated temporal patterns (sequences) are thought to be a salient feature of neural dynamics, but are not succinctly captured by traditional dimensionality reduction techniques. Here, we describe a software toolbox—called seqNMF—with new methods for extracting informative, non-redundant, sequences from high-dimensional neural data, testing the significance of these extracted patterns, and assessing the prevalence of sequential structure in data. We test these methods on simulated data under multiple noise conditions, and on several real neural and behavioral data sets. In hippocampal data, seqNMF identifies neural sequences that match those calculated manually by reference to behavioral events. In songbird data, seqNMF discovers neural sequences in untutored birds that lack stereotyped songs. Thus, by identifying temporal structure directly from neural data, seqNMF enables dissection of complex neural circuits without relying on temporal references from stimuli or behavioral outputs.

Download Full-text

A Fast Multi-Locus Ridge Regression Algorithm for High-Dimensional Genome-Wide Association Studies

Frontiers in Genetics ◽

10.3389/fgene.2021.649196 ◽

2021 ◽

Vol 12 ◽

Author(s):

Jin Zhang ◽

Min Chen ◽

Yangjun Wen ◽

Yin Zhang ◽

Yunan Lu ◽

...

Keyword(s):

Ridge Regression ◽

Large Scale ◽

Genome Wide Association Study ◽

Association Studies ◽

Real Data ◽

Genome Wide Association ◽

High Dimensional ◽

Minor Effect ◽

Genome Wide Association Studies ◽

Genome Wide

The mixed linear model (MLM) has been widely used in genome-wide association study (GWAS) to dissect quantitative traits in human, animal, and plant genetics. Most methodologies consider all single nucleotide polymorphism (SNP) effects as random effects under the MLM framework, which fail to detect the joint minor effect of multiple genetic markers on a trait. Therefore, polygenes with minor effects remain largely unexplored in today’s big data era. In this study, we developed a new algorithm under the MLM framework, which is called the fast multi-locus ridge regression (FastRR) algorithm. The FastRR algorithm first whitens the covariance matrix of the polygenic matrix K and environmental noise, then selects potentially related SNPs among large scale markers, which have a high correlation with the target trait, and finally analyzes the subset variables using a multi-locus deshrinking ridge regression for true quantitative trait nucleotide (QTN) detection. Results from the analyses of both simulated and real data show that the FastRR algorithm is more powerful for both large and small QTN detection, more accurate in QTN effect estimation, and has more stable results under various polygenic backgrounds. Moreover, compared with existing methods, the FastRR algorithm has the advantage of high computing speed. In conclusion, the FastRR algorithm provides an alternative algorithm for multi-locus GWAS in high dimensional genomic datasets.

Download Full-text

HDSI: High dimensional selection with interactions algorithm on feature selection and testing

PLoS ONE ◽

10.1371/journal.pone.0246159 ◽

2021 ◽

Vol 16 (2) ◽

pp. e0246159

Author(s):

Rahi Jain ◽

Wei Xu

Keyword(s):

Feature Selection ◽

Statistical Significance ◽

High Dimensional Data ◽

Feature Selection Method ◽

Subset Selection ◽

Simulated Data ◽

Adaptive Lasso ◽

High Dimensional ◽

Statistical Techniques ◽

Interaction Terms

Feature selection on high dimensional data along with the interaction effects is a critical challenge for classical statistical learning techniques. Existing feature selection algorithms such as random LASSO leverages LASSO capability to handle high dimensional data. However, the technique has two main limitations, namely the inability to consider interaction terms and the lack of a statistical test for determining the significance of selected features. This study proposes a High Dimensional Selection with Interactions (HDSI) algorithm, a new feature selection method, which can handle high-dimensional data, incorporate interaction terms, provide the statistical inferences of selected features and leverage the capability of existing classical statistical techniques. The method allows the application of any statistical technique like LASSO and subset selection on multiple bootstrapped samples; each contains randomly selected features. Each bootstrap data incorporates interaction terms for the randomly sampled features. The selected features from each model are pooled and their statistical significance is determined. The selected statistically significant features are used as the final output of the approach, whose final coefficients are estimated using appropriate statistical techniques. The performance of HDSI is evaluated using both simulated data and real studies. In general, HDSI outperforms the commonly used algorithms such as LASSO, subset selection, adaptive LASSO, random LASSO and group LASSO.

Download Full-text