A Rarefaction-Without-Resampling Extension of PERMANOVA for Testing Presence-Absence Associations in The Microbiome
Abstract Background PERMANOVA [1] is currently the most commonly used method for testing community-level hypotheses about microbiome associations with covariates of interest. PERMANOVA can test for associations that result from changes in which taxa are present or absent by using the Jaccard or unweighted UniFrac distance. However, such presence-absence analyses face a unique challenge: confounding by library size (total sample read count), which occurs when library size is associated with covariates in the analysis. It is known that rarefaction (subsampling to a common library size) controls this bias, but at the potential costs of information loss and the introduction of a stochastic component into the analysis.Methods Here we develop a non-stochastic approach to PERMANOVA presence-absence analyses that aggregates information over all potential rarefaction replicates without actual resampling, when the Jaccard or unweighted UniFrac distance is used. We compare this new approach to three possible ways of aggregating PERMANOVA over multiple rarefactions obtained from resampling: averaging the distance matrix, averaging the (element-wise) squared distance matrix, and averaging the F-statistic.Results Our simulations indicate that our non-stochastic approach is robust to confounding by library size and outperforms each of the stochastic resampling approaches. We also show that, when overdispersion is low, averaging the (element-wise) squared distance outperforms averaging the unsquared distance, currently implemented in the R package vegan. We illustrate our methods using an analysis of data on inflammatory bowel disease (IBD) in which samples from case participants have systematically smaller library sizes than samples from control participants.Conclusions Our extension of PERMANOVA for presence-absence analyses using a non-stochastic approach that aggregates information over all potential rarefaction replicates without actual resampling is robust to confounding by library size and outperforms stochastic resampling approaches.