propr: An R-package for Identifying Proportionally Abundant Features Using Compositional Data Analysis

AbstractIn the life sciences, many assays measure only the relative abundances of components for each sample. These data, called compositional data, require special handling in order to avoid misleading conclusions. For example, in the case of correlation, treating relative data like absolute data can lead to the discovery of falsely positive associations. Recently, researchers have proposed proportionality as a valid alternative to correlation for calculating pairwise association in relative data. Although the question of how to best measure proportionality remains open, we present here a computationally efficient R package that implements two proposed measures of proportionality. In an effort to advance the understanding and application of proportionality analysis, we review the mathematics behind proportionality, demonstrate its application to genomic data, and discuss some ongoing challenges in the analysis of relative abundance data.

Download Full-text

Variable selection in microbiome compositional data analysis

NAR Genomics and Bioinformatics ◽

10.1093/nargab/lqaa029 ◽

2020 ◽

Vol 2 (2) ◽

Cited By ~ 2

Author(s):

Antoni Susin ◽

Yiwen Wang ◽

Kim-Anh Lê Cao ◽

M Luz Calle

Keyword(s):

Data Analysis ◽

Variable Selection ◽

Compositional Data ◽

Penalized Regression ◽

Compositional Data Analysis ◽

Forward Selection ◽

Computationally Efficient ◽

Parsimonious Model ◽

Microbiome Data ◽

Log Ratio

Abstract Though variable selection is one of the most relevant tasks in microbiome analysis, e.g. for the identification of microbial signatures, many studies still rely on methods that ignore the compositional nature of microbiome data. The applicability of compositional data analysis methods has been hampered by the availability of software and the difficulty in interpreting their results. This work is focused on three methods for variable selection that acknowledge the compositional structure of microbiome data: selbal, a forward selection approach for the identification of compositional balances, and clr-lasso and coda-lasso, two penalized regression models for compositional data analysis. This study highlights the link between these methods and brings out some limitations of the centered log-ratio transformation for variable selection. In particular, the fact that it is not subcompositionally consistent makes the microbial signatures obtained from clr-lasso not readily transferable. Coda-lasso is computationally efficient and suitable when the focus is the identification of the most associated microbial taxa. Selbal stands out when the goal is to obtain a parsimonious model with optimal prediction performance, but it is computationally greedy. We provide a reproducible vignette for the application of these methods that will enable researchers to fully leverage their potential in microbiome studies.

Download Full-text

Analysis of relative abundances with zeros on environmental gradients: a multinomial regression model

PeerJ ◽

10.7717/peerj.5643 ◽

2018 ◽

Vol 6 ◽

pp. e5643 ◽

Cited By ~ 2

Author(s):

Fiona Chong ◽

Matthew Spencer

Keyword(s):

Data Analysis ◽

Regression Model ◽

Compositional Data ◽

Environmental Gradients ◽

Compositional Data Analysis ◽

Multinomial Regression ◽

Point Count ◽

Compositional Approach ◽

Multinomial Regression Model ◽

Relative Abundances

Ecologists often analyze relative abundances, which are an example of compositional data. However, they have made surprisingly little use of recent advances in the field of compositional data analysis. Compositions form a vector space in which addition and scalar multiplication are replaced by operations known as perturbation and powering. This algebraic structure makes it easy to understand how relative abundances change along environmental gradients. We illustrate this with an analysis of changes in hard-substrate marine communities along a depth gradient. We fit a quadratic multivariate regression model with multinomial observations to point count data obtained from video transects. As well as being an appropriate observation model in this case, the multinomial deals with the problem of zeros, which often makes compositional data analysis difficult. We show how the algebra of compositions can be used to understand patterns in dissimilarity. We use the calculus of simplex-valued functions to estimate rates of change, and to summarize the structure of the community over a vertical slice. We discuss the benefits of the compositional approach in the interpretation and visualization of relative abundance data.

Download Full-text

Compositional Data Analysis using Kernels in Mass Cytometry Data

10.1101/2021.05.08.443265 ◽

2021 ◽

Author(s):

Pratyaydipta Rudra ◽

Ryan Baxter ◽

Elena WY Hsieh ◽

Debashis Ghosh

Keyword(s):

Data Analysis ◽

Lupus Erythematosus ◽

Compositional Data ◽

Small Sample ◽

Supplementary Information ◽

High Dimensional ◽

Compositional Data Analysis ◽

Cell Type ◽

Mass Cytometry ◽

Abundance Data

Motivation: Cell type abundance data arising from mass cytometry experiments are compositional in nature. Classical association tests do not apply to the compositional data due to their non-Euclidean nature. Existing methods for analysis of cell type abundance data suffer from several limitations for high-dimensional mass cytometry data, especially when the sample size is small. Results: We proposed a new multivariate statistical learning methodology, Compositional Data Analysis using Kernels (CODAK), based on the kernel distance covariance (KDC) framework to test the association of the cell type compositions with important predictors (categorical or continuous) such as disease status. CODAK scales well for high-dimensional data and provides satisfactory performance for small sample sizes (n<25). We conducted simulation studies to compare the performance of the method with existing methods of analyzing cell type abundance data from mass cytometry studies. The method is also applied to a high-dimensional dataset containing different subgroups of populations including Systemic Lupus Erythematosus (SLE) patients and healthy control subjects. Availability and Implementation: CODAK is implemented using R. The codes and the data used in this manuscript are available on the web at http://github.com/GhoshLab/CODAK/. Supplementary information: Supplementary Materials.pdf.

Download Full-text

propr: An R-package for Identifying Proportionally Abundant Features Using Compositional Data Analysis

Scientific Reports ◽

10.1038/s41598-017-16520-0 ◽

2017 ◽

Vol 7 (1) ◽

Cited By ~ 48

Author(s):

Thomas P. Quinn ◽

Mark F. Richardson ◽

David Lovell ◽

Tamsyn M. Crowley

Keyword(s):

Data Analysis ◽

Compositional Data ◽

R Package ◽

Compositional Data Analysis

Download Full-text

Author response for "The “Goldilocks Day” for children's skeletal health: compositional data analysis of 24‐hour activity behaviors"

10.1002/jbmr.4143/v2/response1 ◽

2020 ◽

Author(s):

Dorothea Dumuid ◽

Peter Simm ◽

Melissa Wake ◽

David Burgner ◽

Markus Juonala ◽

...

Keyword(s):

Data Analysis ◽

Compositional Data ◽

Author Response ◽

Compositional Data Analysis ◽

Skeletal Health

Download Full-text

Review for "The “Goldilocks Day” for children's skeletal health: compositional data analysis of 24‐hour activity behaviors"

10.1002/jbmr.4143/v1/review1 ◽

2020 ◽

Keyword(s):

Data Analysis ◽

Compositional Data ◽

Compositional Data Analysis ◽

Skeletal Health

Download Full-text

Insights on the characteristics and sources of gas from an underground coal mine using compositional data analysis

International Journal of Coal Geology ◽

10.1016/j.coal.2021.103767 ◽

2021 ◽

pp. 103767

Author(s):

C. Özgen Karacan ◽

Josep Antoni Martín-Fernández ◽

Leslie F. Ruppert ◽

Ricardo A. Olea

Keyword(s):

Data Analysis ◽

Coal Mine ◽

Compositional Data ◽

Compositional Data Analysis ◽

Underground Coal Mine

Download Full-text

Effects of Two Randomized and Controlled Multi-Component Interventions Focusing On 24-Hour Movement Behavior among Office Workers: A Compositional Data Analysis

International Journal of Environmental Research and Public Health ◽

10.3390/ijerph18084191 ◽

2021 ◽

Vol 18 (8) ◽

pp. 4191

Author(s):

Lisa-Marie Larisch ◽

Emil Bojsen-Møller ◽

Carla F. J. Nooijen ◽

Victoria Blom ◽

Maria Ekblom ◽

...

Keyword(s):

Physical Activity ◽

Data Analysis ◽

Cardiorespiratory Fitness ◽

Leisure Time ◽

Compositional Data ◽

Office Workers ◽

Movement Behavior ◽

Compositional Data Analysis ◽

Intervention Effects ◽

Time In Bed

Intervention studies aiming at changing movement behavior have usually not accounted for the compositional nature of time-use data. Compositional data analysis (CoDA) has been suggested as a useful strategy for analyzing such data. The aim of this study was to examine the effects of two multi-component interventions on 24-h movement behavior (using CoDA) and on cardiorespiratory fitness among office workers; one focusing on reducing sedentariness and the other on increasing physical activity. Office workers (n = 263) were cluster randomized into one of two 6-month intervention groups, or a control group. Time spent in sedentary behavior, light-intensity, moderate and vigorous physical activity, and time in bed were assessed using accelerometers and diaries, both for 24 h in total, and for work and leisure time separately. Cardiorespiratory fitness was estimated using a sub-maximal cycle ergometer test. Intervention effects were analyzed using linear mixed models. No intervention effects were found, either for 24-h behaviors in total, or for work and leisure time behaviors separately. Cardiorespiratory fitness did not change significantly. Despite a thorough analysis of 24-h behaviors using CoDA, no intervention effects were found, neither for behaviors in total, nor for work and leisure time behaviors separately. Cardiorespiratory fitness did not change significantly. Although the design of the multi-component interventions was based on theoretical frameworks, and included cognitive behavioral therapy counselling, which has been proven effective in other populations, issues related to implementation of and compliance with some intervention components may have led to the observed lack of intervention effect.

Download Full-text

The Gut Microbiota of Healthy Aged Chinese Is Similar to That of the Healthy Young

mSphere ◽

10.1128/msphere.00327-17 ◽

2017 ◽

Vol 2 (5) ◽

Cited By ~ 65

Author(s):

Gaorui Bian ◽

Gregory B. Gloor ◽

Aihua Gong ◽

Changsheng Jia ◽

Wei Zhang ◽

...

Keyword(s):

Data Analysis ◽

Gut Microbiota ◽

Large Scale ◽

Compositional Data ◽

Healthy Lifestyle ◽

Compositional Data Analysis ◽

Surprising Result ◽

Microbiota Composition ◽

Cross Sectional ◽

Age Cohorts

ABSTRACT We report the large-scale use of compositional data analysis to establish a baseline microbiota composition in an extremely healthy cohort of the Chinese population. This baseline will serve for comparison for future cohorts with chronic or acute disease. In addition to the expected difference in the microbiota of children and adults, we found that the microbiota of the elderly in this population was similar in almost all respects to that of healthy people in the same population who are scores of years younger. We speculate that this similarity is a consequence of an active healthy lifestyle and diet, although cause and effect cannot be ascribed in this (or any other) cross-sectional design. One surprising result was that the gut microbiota of persons in their 20s was distinct from those of other age cohorts, and this result was replicated, suggesting that it is a reproducible finding and distinct from those of other populations. The microbiota of the aged is variously described as being more or less diverse than that of younger cohorts, but the comparison groups used and the definitions of the aged population differ between experiments. The differences are often described by null hypothesis statistical tests, which are notoriously irreproducible when dealing with large multivariate samples. We collected and examined the gut microbiota of a cross-sectional cohort of more than 1,000 very healthy Chinese individuals who spanned ages from 3 to over 100 years. The analysis of 16S rRNA gene sequencing results used a compositional data analysis paradigm coupled with measures of effect size, where ordination, differential abundance, and correlation can be explored and analyzed in a unified and reproducible framework. Our analysis showed several surprising results compared to other cohorts. First, the overall microbiota composition of the healthy aged group was similar to that of people decades younger. Second, the major differences between groups in the gut microbiota profiles were found before age 20. Third, the gut microbiota differed little between individuals from the ages of 30 to >100. Fourth, the gut microbiota of males appeared to be more variable than that of females. Taken together, the present findings suggest that the microbiota of the healthy aged in this cross-sectional study differ little from that of the healthy young in the same population, although the minor variations that do exist depend upon the comparison cohort. IMPORTANCE We report the large-scale use of compositional data analysis to establish a baseline microbiota composition in an extremely healthy cohort of the Chinese population. This baseline will serve for comparison for future cohorts with chronic or acute disease. In addition to the expected difference in the microbiota of children and adults, we found that the microbiota of the elderly in this population was similar in almost all respects to that of healthy people in the same population who are scores of years younger. We speculate that this similarity is a consequence of an active healthy lifestyle and diet, although cause and effect cannot be ascribed in this (or any other) cross-sectional design. One surprising result was that the gut microbiota of persons in their 20s was distinct from those of other age cohorts, and this result was replicated, suggesting that it is a reproducible finding and distinct from those of other populations.

Download Full-text