Compositional Data Analysis of Microbiome and Any-Omics Datasets: A Validation of the Additive Logratio Transformation

Microbiome and omics datasets are, by their intrinsic biological nature, of high dimensionality, characterized by counts of large numbers of components (microbial genes, operational taxonomic units, RNA transcripts, etc.). These data are generally regarded as compositional since the total number of counts identified within a sample is irrelevant. The central concept in compositional data analysis is the logratio transformation, the simplest being the additive logratios with respect to a fixed reference component. A full set of additive logratios is not isometric, that is they do not reproduce the geometry of all pairwise logratios exactly, but their lack of isometry can be measured by the Procrustes correlation. The reference component can be chosen to maximize the Procrustes correlation between the additive logratio geometry and the exact logratio geometry, and for high-dimensional data there are many potential references. As a secondary criterion, minimizing the variance of the reference component's log-transformed relative abundance values makes the subsequent interpretation of the logratios even easier. On each of three high-dimensional omics datasets the additive logratio transformation was performed, using references that were identified according to the abovementioned criteria. For each dataset the compositional data structure was successfully reproduced, that is the additive logratios were very close to being isometric. The Procrustes correlations achieved for these datasets were 0.9991, 0.9974, and 0.9902, respectively. We thus demonstrate, for high-dimensional compositional data, that additive logratios can provide a valid choice as transformed variables, which (a) are subcompositionally coherent, (b) explain 100% of the total logratio variance and (c) come measurably very close to being isometric. The interpretation of additive logratios is much simpler than the complex isometric alternatives and, when the variance of the log-transformed reference is very low, it is even simpler since each additive logratio can be identified with a corresponding compositional component.

Download Full-text

Compositional data analysis of microbiome and any-omics datasets: a revalidation of the additive logratio transformation

10.1101/2021.05.15.444300 ◽

2021 ◽

Author(s):

Michael Greenacre ◽

Marina Martinez-Alvaro ◽

Agustin Blasco

Keyword(s):

Data Analysis ◽

Compositional Data ◽

High Dimensional ◽

Compositional Data Analysis ◽

Rna Transcripts ◽

Operational Taxonomic Units ◽

Logratio Transformation ◽

Large Numbers ◽

Fixed Reference ◽

Reference Component

Background: Microbiome and omics datasets are, by their intrinsic biological nature, of high dimensionality, characterized by counts of large numbers of components (microbial genes, operational taxonomic units, RNA transcripts, etc...). These data are generally regarded as compositional since the total number of counts identified within a sample are irrelevant. The central concept in compositional data analysis is the logratio transformation, the simplest being the additive logratios with respect to a fixed reference component. A full set of additive logratios is not isometric in the sense of reproducing the geometry of all pairwise logratios exactly, but their lack of isometry can be measured by the Procrustes correlation. The reference component can be chosen to maximize the Procrustes correlation between the additive logratio geometry and the exact logratio geometry, and for high-dimensional data there are many potential references. As a secondary criterion, minimizing the variance of the reference component's log-transformed relative abundance values makes the subsequent interpretation of the logratios even easier. Finally, it is preferable that the reference component not be a rare component but well populated, and substantive biological reasons might also guide the choice if several reference candidates are identified. Results: On each of three high-dimensional datasets the additive logratio transformation was performed, using references that were identified according to the abovementioned criteria.For each dataset the compositional data structure was successfully reproduced, that is the additive logratios were very close to being isometric. The Procrustes correlations achieved for these datasets were 0.9991, 0.9977 and 0.9997, respectively. In the third case, where the objective was to distinguish between three groups of samples, the approximation was made to the restricted logratio space of the between-group variance. Conclusions: We show that for high-dimensional compositional data additive logratios can provide a valid choice as transformed variables that are (1) subcompositionally coherent, (2) explaining 100% of the total logratio variance and (3) coming measurably very close to being isometric, that is approximating almost perfectly the exact logratio geometry. The interpretation of additive logratios is simple and, when the variance of the log-transformed reference is very low, it is made even simpler since each additive logratio can be identified with a corresponding compositional component.

Download Full-text

High-dimensional Log-Error-in-Variable Regression with Applications to Microbial Compositional Data Analysis

Biometrika ◽

10.1093/biomet/asab020 ◽

2021 ◽

Author(s):

Pixu Shi ◽

Yuchen Zhou ◽

Anru R Zhang

Keyword(s):

Data Analysis ◽

Compositional Data ◽

Estimation Error ◽

Real Data ◽

Upper And Lower Bounds ◽

High Dimensional ◽

Compositional Data Analysis ◽

Sequencing Data ◽

Contrast Model ◽

Critical Issues

Abstract In microbiome and genomic studies, the regression of compositional data has been a crucial tool for identifying microbial taxa or genes that are associated with clinical phenotypes. To account for the variation in sequencing depth, the classic log-contrast model is often used where read counts are normalized into compositions. However, zero read counts and the randomness in covariates remain critical issues. In this article, we introduce a surprisingly simple, interpretable, and efficient method for the estimation of compositional data regression through the lens of a novel high-dimensional log-error-in-variable regression model. The proposed method provides both corrections on sequencing data with possible overdispersion and simultaneously avoids any subjective imputation of zero read counts. We provide theoretical justifications with matching upper and lower bounds for the estimation error. The merit of the procedure is illustrated through real data analysis and simulation studies.

Download Full-text

Microbiome, Metagenomics, and High-Dimensional Compositional Data Analysis

Annual Review of Statistics and Its Application ◽

10.1146/annurev-statistics-010814-020351 ◽

2015 ◽

Vol 2 (1) ◽

pp. 73-94 ◽

Cited By ~ 117

Author(s):

Hongzhe Li

Keyword(s):

Data Analysis ◽

Compositional Data ◽

High Dimensional ◽

Compositional Data Analysis

Download Full-text

Compositional Data Analysis using Kernels in Mass Cytometry Data

10.1101/2021.05.08.443265 ◽

2021 ◽

Author(s):

Pratyaydipta Rudra ◽

Ryan Baxter ◽

Elena WY Hsieh ◽

Debashis Ghosh

Keyword(s):

Data Analysis ◽

Lupus Erythematosus ◽

Compositional Data ◽

Small Sample ◽

Supplementary Information ◽

High Dimensional ◽

Compositional Data Analysis ◽

Cell Type ◽

Mass Cytometry ◽

Abundance Data

Motivation: Cell type abundance data arising from mass cytometry experiments are compositional in nature. Classical association tests do not apply to the compositional data due to their non-Euclidean nature. Existing methods for analysis of cell type abundance data suffer from several limitations for high-dimensional mass cytometry data, especially when the sample size is small. Results: We proposed a new multivariate statistical learning methodology, Compositional Data Analysis using Kernels (CODAK), based on the kernel distance covariance (KDC) framework to test the association of the cell type compositions with important predictors (categorical or continuous) such as disease status. CODAK scales well for high-dimensional data and provides satisfactory performance for small sample sizes (n<25). We conducted simulation studies to compare the performance of the method with existing methods of analyzing cell type abundance data from mass cytometry studies. The method is also applied to a high-dimensional dataset containing different subgroups of populations including Systemic Lupus Erythematosus (SLE) patients and healthy control subjects. Availability and Implementation: CODAK is implemented using R. The codes and the data used in this manuscript are available on the web at http://github.com/GhoshLab/CODAK/. Supplementary information: Supplementary Materials.pdf.

Download Full-text

Visualizing balances of compositional data: A new alternative to balance dendrograms

F1000Research ◽

10.12688/f1000research.15858.1 ◽

2018 ◽

Vol 7 ◽

pp. 1278 ◽

Cited By ~ 3

Author(s):

Thomas P. Quinn

Keyword(s):

Data Analysis ◽

Compositional Data ◽

High Dimensional Data ◽

Large Data ◽

High Dimensional ◽

Compositional Data Analysis ◽

Data Set ◽

R Programming Language ◽

Common Scale ◽

R Programming

Balances have become a cornerstone of compositional data analysis. However, conceptualizing balances is difficult, especially for high-dimensional data. Most often, investigators visualize balances with the balance dendrogram, but this technique is not necessarily intuitive and does not scale well for large data. This manuscript introduces the 'balance' package for the R programming language. This package visualizes balances of compositional data using an alternative to the balance dendrogram. This alternative contains the same information coded by the balance dendrogram, but projects data on a common scale that facilitates direct comparisons and accommodates high-dimensional data. By stripping the branches from the tree, 'balance' can cleanly visualize any subset of balances without disrupting the interpretation of the remaining balances. As an example, this package is applied to a publicly available meta-genomics data set measuring the relative abundance of 500 microbe taxa.

Download Full-text

High-dimensional count and compositional data analysis in\\ microbiome studies

Scientia Sinica Mathematica ◽

10.1360/n012017-00147 ◽

2017 ◽

Vol 47 (12) ◽

pp. 1735-1760

Author(s):

DENG MingHua ◽

HE Shun ◽

WU ChangJing

Keyword(s):

Data Analysis ◽

Compositional Data ◽

High Dimensional ◽

Compositional Data Analysis

Download Full-text

Author response for "The “Goldilocks Day” for children's skeletal health: compositional data analysis of 24‐hour activity behaviors"

10.1002/jbmr.4143/v2/response1 ◽

2020 ◽

Author(s):

Dorothea Dumuid ◽

Peter Simm ◽

Melissa Wake ◽

David Burgner ◽

Markus Juonala ◽

...

Keyword(s):

Data Analysis ◽

Compositional Data ◽

Author Response ◽

Compositional Data Analysis ◽

Skeletal Health

Download Full-text

Review for "The “Goldilocks Day” for children's skeletal health: compositional data analysis of 24‐hour activity behaviors"

10.1002/jbmr.4143/v1/review1 ◽

2020 ◽

Keyword(s):

Data Analysis ◽

Compositional Data ◽

Compositional Data Analysis ◽

Skeletal Health

Download Full-text

Insights on the characteristics and sources of gas from an underground coal mine using compositional data analysis

International Journal of Coal Geology ◽

10.1016/j.coal.2021.103767 ◽

2021 ◽

pp. 103767

Author(s):

C. Özgen Karacan ◽

Josep Antoni Martín-Fernández ◽

Leslie F. Ruppert ◽

Ricardo A. Olea

Keyword(s):

Data Analysis ◽

Coal Mine ◽

Compositional Data ◽

Compositional Data Analysis ◽

Underground Coal Mine

Download Full-text

Effects of Two Randomized and Controlled Multi-Component Interventions Focusing On 24-Hour Movement Behavior among Office Workers: A Compositional Data Analysis

International Journal of Environmental Research and Public Health ◽

10.3390/ijerph18084191 ◽

2021 ◽

Vol 18 (8) ◽

pp. 4191

Author(s):

Lisa-Marie Larisch ◽

Emil Bojsen-Møller ◽

Carla F. J. Nooijen ◽

Victoria Blom ◽

Maria Ekblom ◽

...

Keyword(s):

Physical Activity ◽

Data Analysis ◽

Cardiorespiratory Fitness ◽

Leisure Time ◽

Compositional Data ◽

Office Workers ◽

Movement Behavior ◽

Compositional Data Analysis ◽

Intervention Effects ◽

Time In Bed

Intervention studies aiming at changing movement behavior have usually not accounted for the compositional nature of time-use data. Compositional data analysis (CoDA) has been suggested as a useful strategy for analyzing such data. The aim of this study was to examine the effects of two multi-component interventions on 24-h movement behavior (using CoDA) and on cardiorespiratory fitness among office workers; one focusing on reducing sedentariness and the other on increasing physical activity. Office workers (n = 263) were cluster randomized into one of two 6-month intervention groups, or a control group. Time spent in sedentary behavior, light-intensity, moderate and vigorous physical activity, and time in bed were assessed using accelerometers and diaries, both for 24 h in total, and for work and leisure time separately. Cardiorespiratory fitness was estimated using a sub-maximal cycle ergometer test. Intervention effects were analyzed using linear mixed models. No intervention effects were found, either for 24-h behaviors in total, or for work and leisure time behaviors separately. Cardiorespiratory fitness did not change significantly. Despite a thorough analysis of 24-h behaviors using CoDA, no intervention effects were found, neither for behaviors in total, nor for work and leisure time behaviors separately. Cardiorespiratory fitness did not change significantly. Although the design of the multi-component interventions was based on theoretical frameworks, and included cognitive behavioral therapy counselling, which has been proven effective in other populations, issues related to implementation of and compliance with some intervention components may have led to the observed lack of intervention effect.

Download Full-text