scholarly journals Robust logistic zero-sum regression for microbiome compositional data

Author(s):  
G. S. Monti ◽  
P. Filzmoser

AbstractWe introduce the Robust Logistic Zero-Sum Regression (RobLZS) estimator, which can be used for a two-class problem with high-dimensional compositional covariates. Since the log-contrast model is employed, the estimator is able to do feature selection among the compositional parts. The proposed method attains robustness by minimizing a trimmed sum of deviances. A comparison of the performance of the RobLZS estimator with a non-robust counterpart and with other sparse logistic regression estimators is conducted via Monte Carlo simulation studies. Two microbiome data applications are considered to investigate the stability of the estimators to the presence of outliers. Robust Logistic Zero-Sum Regression is available as an R package that can be downloaded at https://github.com/giannamonti/RobZS.

Biometrika ◽  
2021 ◽  
Author(s):  
Pixu Shi ◽  
Yuchen Zhou ◽  
Anru R Zhang

Abstract In microbiome and genomic studies, the regression of compositional data has been a crucial tool for identifying microbial taxa or genes that are associated with clinical phenotypes. To account for the variation in sequencing depth, the classic log-contrast model is often used where read counts are normalized into compositions. However, zero read counts and the randomness in covariates remain critical issues. In this article, we introduce a surprisingly simple, interpretable, and efficient method for the estimation of compositional data regression through the lens of a novel high-dimensional log-error-in-variable regression model. The proposed method provides both corrections on sequencing data with possible overdispersion and simultaneously avoids any subjective imputation of zero read counts. We provide theoretical justifications with matching upper and lower bounds for the estimation error. The merit of the procedure is illustrated through real data analysis and simulation studies.


2017 ◽  
Author(s):  
Michael B. Sohn ◽  
Hongzhe Li

AbstractMotivated by recent advances in causal mediation analysis and problems in the analysis of microbiome data, we consider the setting where the effect of a treatment on an outcome is transmitted through perturbing the microbial communities or compositional mediators. Compositional and high-dimensional nature of such mediators makes the standard mediation analysis not directly applicable to our setting. We propose a sparse compositional mediation model that can be used to estimate the causal direct and indirect (or mediation) effects utilizing the algebra for compositional data in the simplex space. We also propose tests of total and component-wise mediation effects using bootstrap. We conduct extensive simulation studies to assess the performance of the proposed method and apply the method to a real metagenomic dataset to investigate the effect of fat intake on body mass index mediated through the gut microbiome composition.


Author(s):  
Michael B Sohn ◽  
Jiarui Lu ◽  
Hongzhe Li

Abstract Motivation The delicate balance of the microbiome is implicated in our health and is shaped by external factors, such as diet and xenobiotics. Therefore, understanding the role of the microbiome in linking external factors and our health conditions is crucial to translate microbiome research into therapeutic and preventative applications. Results We introduced a sparse compositional mediation model for binary outcomes to estimate and test the mediation effects of the microbiome utilizing the compositional algebra defined in the simplex space and a linear zero-sum constraint on probit regression coefficients. For this model with the standard causal assumptions, we showed that both the causal direct and indirect effects are identifiable. We further developed a method for sensitivity analysis for the assumption of the no unmeasured confounding effects between the mediator and the outcome. We conducted extensive simulation studies to assess the performance of the proposed method and applied it to real microbiome data to study mediation effects of the microbiome on linking fat intake to overweight/obesity. Availability and implementation An R package can be downloaded from https://github.com/mbsohn/cmmb. Supplementary information Supplementary files are available at Bioinformatics online.


2017 ◽  
Vol 2017 ◽  
pp. 1-14 ◽  
Author(s):  
Anne-Laure Boulesteix ◽  
Riccardo De Bin ◽  
Xiaoyu Jiang ◽  
Mathias Fuchs

As modern biotechnologies advance, it has become increasingly frequent that different modalities of high-dimensional molecular data (termed “omics” data in this paper), such as gene expression, methylation, and copy number, are collected from the same patient cohort to predict the clinical outcome. While prediction based on omics data has been widely studied in the last fifteen years, little has been done in the statistical literature on the integration of multiple omics modalities to select a subset of variables for prediction, which is a critical task in personalized medicine. In this paper, we propose a simple penalized regression method to address this problem by assigning different penalty factors to different data modalities for feature selection and prediction. The penalty factors can be chosen in a fully data-driven fashion by cross-validation or by taking practical considerations into account. In simulation studies, we compare the prediction performance of our approach, called IPF-LASSO (Integrative LASSO with Penalty Factors) and implemented in the R package ipflasso, with the standard LASSO and sparse group LASSO. The use of IPF-LASSO is also illustrated through applications to two real-life cancer datasets. All data and codes are available on the companion website to ensure reproducibility.


2019 ◽  
Author(s):  
Arun Srinivasan ◽  
Lingzhou Xue ◽  
Xiang Zhan

SummaryA critical task in microbiome data analysis is to explore the association between a scalar response of interest and a large number of microbial taxa that are summarized as compositional data at different taxonomic levels. Motivated by fine-mapping of the microbiome, we propose a two-step compositional knockoff filter (CKF) to provide the effective finite-sample false discovery rate (FDR) control in high-dimensional linear log-contrast regression analysis of microbiome compositional data. In the first step, we employ the compositional screening procedure to remove insignificant microbial taxa while retaining the essential sum-to-zero constraint. In the second step, we extend the knockoff filter to identify the significant microbial taxa in the sparse regression model for compositional data. Thereby, a subset of the microbes is selected from the high-dimensional microbial taxa as related to the response using a pre-specified FDR threshold. We study the asymptotic properties of the proposed two-step procedure, including both sure screening and effective false discovery control. We demonstrate the finite-sample properties in simulation studies, which show the gain in the empirical power while controlling the nominal FDR. The potential usefulness of the proposed method is also illustrated with application to an inflammatory bowel disease dataset to identify microbial taxa that influence host gene expressions.


2014 ◽  
Vol 17 (4) ◽  
Author(s):  
Raymond K. Walters ◽  
Charles Laurin ◽  
Gitta H. Lubke

Epistasis is a growing area of research in genome-wide studies, but the differences between alternative definitions of epistasis remain a source of confusion for many researchers. One problem is that models for epistasis are presented in a number of formats, some of which have difficult-to-interpret parameters. In addition, the relation between the different models is rarely explained. Existing software for testing epistatic interactions between single-nucleotide polymorphisms (SNPs) does not provide the flexibility to compare the available model parameterizations. For that reason we have developed an R package for investigating epistatic and penetrance models, EpiPen, to aid users who wish to easily compare, interpret, and utilize models for two-locus epistatic interactions. EpiPen facilitates research on SNP-SNP interactions by allowing the R user to easily convert between common parametric forms for two-locus interactions, generate data for simulation studies, and perform power analyses for the selected model with a continuous or dichotomous phenotype. The usefulness of the package for model interpretation and power analysis is illustrated using data on rheumatoid arthritis.


2021 ◽  
Vol 22 (1) ◽  
Author(s):  
Lisa Amrhein ◽  
Christiane Fuchs

Abstract Background Tissues are often heterogeneous in their single-cell molecular expression, and this can govern the regulation of cell fate. For the understanding of development and disease, it is important to quantify heterogeneity in a given tissue. Results We present the R package stochprofML which uses the maximum likelihood principle to parameterize heterogeneity from the cumulative expression of small random pools of cells. We evaluate the algorithm’s performance in simulation studies and present further application opportunities. Conclusion Stochastic profiling outweighs the necessary demixing of mixed samples with a saving in experimental cost and effort and less measurement error. It offers possibilities for parameterizing heterogeneity, estimating underlying pool compositions and detecting differences between cell populations between samples.


1992 ◽  
Vol 6 ◽  
pp. 16-16 ◽  
Author(s):  
Richard K. Bambach ◽  
J. John Sepkoski

The first two ranks above the species level in the traditional Linnean hierarchy — the genus and family — are species based: genera have been erected to unify groups of morphologically similar, closely related species and families have been erected to group genera recognized as closely related because of the shared morphologic characteristics of their species. Diversity patterns of traditional genera and families thus appear congruent with those of species in (a) the Recent (e. g., latitudinal gradients in many groups), (b) compilations of all marine taxa for the entire Phanerozoic (including the stage level), (c) comparisons through time within individual taxa (e. g., Foraminifera, Rugosa, Conodonta), and (d) simulation studies. Genera and families often have a more robust fossil record of diversity than species, especially for poorly sampled groups (e. g., echinoids), because of the range-through record of these polytypic taxa. Simulation studies indicate that paraphyly among traditionally defined taxa is not a fatal problem for diversity studies; in fact, when degradation of the quality of the fossil record is modelled, both diversity and rates of origination and extinction are better represented by including paraphyletic taxa than by restricting data to monophyletic clades. This result underscores the utility of traditional rank-based analyses of the history of diversity.In contrast, the three higher ranks of the Linnean hierarchy — orders, classes and phyla — are defined and recognized by key character complexes assumed to be rooted deep in the developmental program and, therefore, considered to be of special significance. These taxa are unified on the basis of body plan and function, not species morphology. Even if paraphyletic, recognition of such taxa is useful because they represent different functional complexes that reflect biological organization and major evolutionary innovations, often with different ecological capacities. Phanerozoic diversity patterns of orders, classes and phyla are not congruent with those of lower taxa; the higher groups each increased rapidly in the early Paleozoic, during the explosive diversification of body plans in the Cambrian, and then remained stable or declined slightly after the Ordovician. The diversity history of orders superficially resembles that of lower taxa, but this is a result only of ordinal turnover among the Echinodermata coupled with ordinal radiation in the Chordata; it is not a highly damped signal derived from the diversity of species, genera, or families. Despite the stability of numbers among post-Ordovician Linnean higher taxa, the diversity of lower taxa within many of these Bauplan groups fluctuated widely, and these diversity patterns signal embedded ecologic information, such as differences in flexibility in filling or utilizing ecospace.Phylogenetic analysis is vital for understanding the origins and genealogical structure of higher taxa. Only in such fashion can convergence and its implications for ecological constraints and/or opportunities be understood. But blind insistence on the use of monophyletic classifications in all studies would obscure some of the important information contained in traditional taxonomic groupings. The developmental modifications that characterize Linnean higher taxa (and traditionally separate them from their paraphyletic ancestral taxa) provide keys to understanding the role of shifting ecology in macroevolutionary success.


2021 ◽  
Vol 4 (1) ◽  
pp. 251524592097262
Author(s):  
Don van Ravenzwaaij ◽  
Alexander Etz

When social scientists wish to learn about an empirical phenomenon, they perform an experiment. When they wish to learn about a complex numerical phenomenon, they can perform a simulation study. The goal of this Tutorial is twofold. First, it introduces how to set up a simulation study using the relatively simple example of simulating from the prior. Second, it demonstrates how simulation can be used to learn about the Jeffreys-Zellner-Siow (JZS) Bayes factor, a currently popular implementation of the Bayes factor employed in the BayesFactor R package and freeware program JASP. Many technical expositions on Bayes factors exist, but these may be somewhat inaccessible to researchers who are not specialized in statistics. In a step-by-step approach, this Tutorial shows how a simple simulation script can be used to approximate the calculation of the Bayes factor. We explain how a researcher can write such a sampler to approximate Bayes factors in a few lines of code, what the logic is behind the Savage-Dickey method used to visualize Bayes factors, and what the practical differences are for different choices of the prior distribution used to calculate Bayes factors.


Sign in / Sign up

Export Citation Format

Share Document