A field guide for the compositional analysis of any-omics data

Abstract Background Next-generation sequencing (NGS) has made it possible to determine the sequence and relative abundance of all nucleotides in a biological or environmental sample. A cornerstone of NGS is the quantification of RNA or DNA presence as counts. However, these counts are not counts per se: their magnitude is determined arbitrarily by the sequencing depth, not by the input material. Consequently, counts must undergo normalization prior to use. Conventional normalization methods require a set of assumptions: they assume that the majority of features are unchanged and that all environments under study have the same carrying capacity for nucleotide synthesis. These assumptions are often untestable and may not hold when heterogeneous samples are compared. Results Methods developed within the field of compositional data analysis offer a general solution that is assumption-free and valid for all data. Herein, we synthesize the extant literature to provide a concise guide on how to apply compositional data analysis to NGS count data. Conclusions In highlighting the limitations of total library size, effective library size, and spike-in normalizations, we propose the log-ratio transformation as a general solution to answer the question, “Relative to some important activity of the cell, what is changing?”

Download Full-text

A field guide for the compositional analysis of any-omics data

10.1101/484766 ◽

2018 ◽

Cited By ~ 3

Author(s):

Thomas P. Quinn ◽

Ionas Erb ◽

Greg Gloor ◽

Cedric Notredame ◽

Mark F. Richardson ◽

...

Keyword(s):

Data Analysis ◽

General Solution ◽

Compositional Data ◽

Compositional Analysis ◽

Compositional Data Analysis ◽

Sequencing Data ◽

Nucleotide Synthesis ◽

Library Size ◽

Concise Guide ◽

Log Ratio

AbstractNext-generation sequencing (NGS) has made it possible to determine the sequence and relative abundance of all nucleotides in a biological or environmental sample. Today, NGS is routinely used to understand many important topics in biology from human disease to microorganism diversity. A cornerstone of NGS is the quantification of RNA or DNA presence as counts. However, these counts are not counts per se: the magnitude of the counts are determined arbitrarily by the sequencing depth, not by the input material. Consequently, counts must undergo normalization prior to use. Conventional normalization methods require a set of assumptions: they assume that the majority of features are unchanged, and that all environments under study have the same carrying capacity for nucleotide synthesis. These assumptions are often untestable and may not hold when comparing heterogeneous samples (e.g., samples collected across distinct cancers or tissues). Instead, methods developed within the field of compositional data analysis offer a general solution that is assumption-free and valid for all data. In this manuscript, we synthesize the extant literature to provide a concise guide on how to apply compositional data analysis to NGS count data. In doing so, we review zero replacement, differential abundance analysis, and within-group and between-group coordination analysis. We then discuss how this pipeline can accommodate complex study design, facilitate the analysis of vertically and horizontally integrated data, including multiomics data, and further extend to single-cell sequencing data. In highlighting the limitations of total library size, effective library size, and spike-in normalizations, we propose the log-ratio transformation as a general solution to answer the question, “Relative to some important activity of the cell, what is changing?”. Taken together, this manuscript establishes the first fully comprehensive analysis protocol that is suitable for any and all -omics data.

Download Full-text

An application of compositional data analysis to multiomic time-series data

NAR Genomics and Bioinformatics ◽

10.1093/nargab/lqaa079 ◽

2020 ◽

Vol 2 (4) ◽

Cited By ~ 1

Author(s):

Laura Sisk-Hackworth ◽

Scott T Kelley

Keyword(s):

Data Analysis ◽

Time Series Data ◽

Compositional Data ◽

Series Data ◽

Compositional Data Analysis ◽

Metabolomics Data ◽

Normalization Methods ◽

Next Generation Sequencing Ngs ◽

Ngs Data ◽

Log Ratio

Abstract Compositional data analysis (CoDA) methods have increased in popularity as a new framework for analyzing next-generation sequencing (NGS) data. CoDA methods, such as the centered log-ratio (clr) transformation, adjust for the compositional nature of NGS counts, which is not addressed by traditional normalization methods. CoDA has only been sparsely applied to NGS data generated from microbial communities or to multiple ‘omics’ datasets. In this study, we applied CoDA methods to analyze NGS and untargeted metabolomic datasets obtained from bacterial and fungal communities. Specifically, we used clr transformation to reanalyze NGS amplicon and metabolomics data from a study investigating the effects of building material type, moisture and time on microbial and metabolomic diversity. Compared to analysis of untransformed data, analysis of clr-transformed data revealed novel relationships and stronger associations between sample conditions and microbial and metabolic community profiles.

Download Full-text

Variable selection in microbiome compositional data analysis

NAR Genomics and Bioinformatics ◽

10.1093/nargab/lqaa029 ◽

2020 ◽

Vol 2 (2) ◽

Cited By ~ 2

Author(s):

Antoni Susin ◽

Yiwen Wang ◽

Kim-Anh Lê Cao ◽

M Luz Calle

Keyword(s):

Data Analysis ◽

Variable Selection ◽

Compositional Data ◽

Penalized Regression ◽

Compositional Data Analysis ◽

Forward Selection ◽

Computationally Efficient ◽

Parsimonious Model ◽

Microbiome Data ◽

Log Ratio

Abstract Though variable selection is one of the most relevant tasks in microbiome analysis, e.g. for the identification of microbial signatures, many studies still rely on methods that ignore the compositional nature of microbiome data. The applicability of compositional data analysis methods has been hampered by the availability of software and the difficulty in interpreting their results. This work is focused on three methods for variable selection that acknowledge the compositional structure of microbiome data: selbal, a forward selection approach for the identification of compositional balances, and clr-lasso and coda-lasso, two penalized regression models for compositional data analysis. This study highlights the link between these methods and brings out some limitations of the centered log-ratio transformation for variable selection. In particular, the fact that it is not subcompositionally consistent makes the microbial signatures obtained from clr-lasso not readily transferable. Coda-lasso is computationally efficient and suitable when the focus is the identification of the most associated microbial taxa. Selbal stands out when the goal is to obtain a parsimonious model with optimal prediction performance, but it is computationally greedy. We provide a reproducible vignette for the application of these methods that will enable researchers to fully leverage their potential in microbiome studies.

Download Full-text

Assessing Global Covid-19 Cases Data through Compositional Data Analysis(CoDa)

10.1101/2020.12.17.20248424 ◽

2020 ◽

Author(s):

Luis P.V. Braga ◽

Dina Feigenbaum

Keyword(s):

Data Analysis ◽

Compositional Data ◽

Compositional Data Analysis ◽

Discrete Groups ◽

Data Sets ◽

Cumulative Number ◽

Governmental Agencies ◽

Global Pandemic ◽

Number Of Patients ◽

Log Ratio

AbstractBackgroundCovid-19 cases data pose an enormous challenge to any analysis. The evaluation of such a global pandemic requires matching reports that follow different procedures and even overcoming some countries’ censorship that restricts publications.MethodsThis work proposes a methodology that could assist future studies. Compositional Data Analysis (CoDa) is proposed as the proper approach as Covid-19 cases data is compositional in nature. Under this methodology, for each country three attributes were selected: cumulative number of deaths (D); cumulative number of recovered patients(R); present number of patients (A).ResultsAfter the operation called closure, with c=1, a ternary diagram and Log-Ratio plots, as well as, compositional statistics are presented. Cluster analysis is then applied, splitting the countries into discrete groups.ConclusionsThis methodology can also be applied to other data sets such as countries, cities, provinces or districts in order to help authorities and governmental agencies to improve their actions to fight against a pandemic.

Download Full-text

Compositional analysis of dietary patterns

Statistical Methods in Medical Research ◽

10.1177/0962280218790110 ◽

2018 ◽

Vol 28 (9) ◽

pp. 2834-2847 ◽

Cited By ~ 2

Author(s):

M Solans ◽

G Coenders ◽

R Marcos-Gragera ◽

A Castelló ◽

E Gràcia-Lavedan ◽

...

Keyword(s):

Data Analysis ◽

Dietary Patterns ◽

Pattern Analysis ◽

Compositional Data ◽

Compositional Analysis ◽

Principal Component ◽

Compositional Data Analysis ◽

Trade Offs ◽

Dietary Pattern Analysis ◽

The Relationship

Instead of looking at individual nutrients or foods, dietary pattern analysis has emerged as a promising approach to examine the relationship between diet and health outcomes. Despite dietary patterns being compositional (i.e. usually a higher intake of some foods implies that less of other foods are being consumed), compositional data analysis has not yet been applied in this setting. We describe three compositional data analysis approaches (compositional principal component analysis, balances and principal balances) that enable the extraction of dietary patterns by using control subjects from the Spanish multicase-control (MCC-Spain) study. In particular, principal balances overcome the limitations of purely data-driven or investigator-driven methods and present dietary patterns as trade-offs between eating more of some foods and less of others.

Download Full-text

Zero problems with compositional data of physical behaviors: a comparison of three zero replacement methods

International Journal of Behavioral Nutrition and Physical Activity ◽

10.1186/s12966-020-01029-z ◽

2020 ◽

Vol 17 (1) ◽

Author(s):

Charlotte Lund Rasmussen ◽

Javier Palarea-Albaladejo ◽

Melker Staffan Johansson ◽

Patrick Crowley ◽

Matthew Leigh Stevens ◽

...

Keyword(s):

Data Analysis ◽

Time Use ◽

Compositional Data ◽

Real Data ◽

Compositional Data Analysis ◽

Accelerometer Data ◽

Relative Variation ◽

Physical Behavior ◽

Complete Dataset ◽

Log Ratio

Abstract Background Researchers applying compositional data analysis to time-use data (e.g., time spent in physical behaviors) often face the problem of zeros, that is, recordings of zero time spent in any of the studied behaviors. Zeros hinder the application of compositional data analysis because the analysis is based on log-ratios. One way to overcome this challenge is to replace the zeros with sensible small values. The aim of this study was to compare the performance of three existing replacement methods used within physical behavior time-use epidemiology: simple replacement, multiplicative replacement, and log-ratio expectation-maximization (lrEM) algorithm. Moreover, we assessed the consequence of choosing replacement values higher than the lowest observed value for a given behavior. Method Using a complete dataset based on accelerometer data from 1310 Danish adults as reference, multiple datasets were simulated across six scenarios of zeros (5–30% zeros in 5% increments). Moreover, four examples were produced based on real data, in which, 10 and 20% zeros were imposed and replaced using a replacement value of 0.5 min, 65% of the observation threshold, or an estimated value below the observation threshold. For the simulation study and the examples, the zeros were replaced using the three replacement methods and the degree of distortion introduced was assessed by comparison with the complete dataset. Results The lrEM method outperformed the other replacement methods as it had the smallest influence on the structure of relative variation of the datasets. Both the simple and multiplicative replacements introduced higher distortion, particularly in scenarios with more than 10% zeros; although the latter, like the lrEM, does preserve the ratios between behaviors with no zeros. The examples revealed that replacing zeros with a value higher than the observation threshold severely affected the structure of relative variation. Conclusions Given our findings, we encourage the use of replacement methods that preserve the relative structure of physical behavior data, as achieved by the multiplicative and lrEM replacements, and to avoid simple replacement. Moreover, we do not recommend replacing zeros with values higher than the lowest observed value for a behavior.

Download Full-text

Compositional data analysis for physical activity, sedentary time and sleep research

Statistical Methods in Medical Research ◽

10.1177/0962280217710835 ◽

2017 ◽

Vol 27 (12) ◽

pp. 3726-3738 ◽

Cited By ~ 91

Author(s):

Dorothea Dumuid ◽

Tyman E Stanford ◽

Josep-Antoni Martin-Fernández ◽

Željko Pedišić ◽

Carol A Maher ◽

...

Keyword(s):

Physical Activity ◽

Data Analysis ◽

Health Effects ◽

Daily Activity ◽

Sedentary Time ◽

Compositional Data ◽

Time Budget ◽

International Study ◽

Compositional Data Analysis ◽

Log Ratio

The health effects of daily activity behaviours (physical activity, sedentary time and sleep) are widely studied. While previous research has largely examined activity behaviours in isolation, recent studies have adjusted for multiple behaviours. However, the inclusion of all activity behaviours in traditional multivariate analyses has not been possible due to the perfect multicollinearity of 24-h time budget data. The ensuing lack of adjustment for known effects on the outcome undermines the validity of study findings. We describe a statistical approach that enables the inclusion of all daily activity behaviours, based on the principles of compositional data analysis. Using data from the International Study of Childhood Obesity, Lifestyle and the Environment, we demonstrate the application of compositional multiple linear regression to estimate adiposity from children’s daily activity behaviours expressed as isometric log-ratio coordinates. We present a novel method for predicting change in a continuous outcome based on relative changes within a composition, and for calculating associated confidence intervals to allow for statistical inference. The compositional data analysis presented overcomes the lack of adjustment that has plagued traditional statistical methods in the field, and provides robust and reliable insights into the health effects of daily activity behaviours.

Download Full-text

Compositional uncertainty should not be ignored in high-throughput sequencing data analysis

Austrian Journal of Statistics ◽

10.17713/ajs.v45i4.122 ◽

2016 ◽

Vol 45 (4) ◽

pp. 73-87 ◽

Cited By ~ 32

Author(s):

Gregory Brian Gloor ◽

Jean M. Macklaim ◽

Michael Vu ◽

Andrew D. Fernandes

Keyword(s):

Data Analysis ◽

High Throughput ◽

False Positive ◽

High Throughput Sequencing ◽

In Vitro Selection ◽

Compositional Data ◽

Dirichlet Distribution ◽

Compositional Data Analysis ◽

Compositional Approach ◽

Log Ratio

High throughput sequencing generates sparse compositional data, yet these datasets are rarely analyzed using a compositional approach. In addition, the variation inherent in these datasets is rarely acknowledged, but ignoring it can result in many false positive inferences. We demonstrate that examination of point estimates of the data can result in false positive results, even with appropriate zero replacement approaches, using an in vitro selection dataset with an outside standard of truth. The variation inherent in real high-throughput sequencing datasets is demonstrated, and we show that this varia- tion can be approximated, and hence accounted for, by Monte-Carlo sampling from the Dirichlet distribution. This approximation when used by itself is itself problematic, but becomes useful when coupled with a log-ratio approach commonly used in compositional data analysis. Thus, the approach illustrated here that merges Bayesian estimation with principles of compositional data analysis should be generally useful for high-dimensional count compositional data of the type generated by high throughput sequencing.

Download Full-text

Physical Behaviours in Brazilian Office Workers Working from Home during the COVID-19 Pandemic, Compared to before the Pandemic: A Compositional Data Analysis

International Journal of Environmental Research and Public Health ◽

10.3390/ijerph18126278 ◽

2021 ◽

Vol 18 (12) ◽

pp. 6278

Author(s):

Luiz Augusto Brusaca ◽

Dechristian França Barbieri ◽

Svend Erik Mathiassen ◽

Andreas Holtermann ◽

Ana Beatriz Oliveira

Keyword(s):

Data Analysis ◽

Repeated Measures ◽

Compositional Data ◽

Office Workers ◽

Compositional Data Analysis ◽

Working At Home ◽

Work From Home ◽

Time In Bed ◽

Objectively Measured ◽

Log Ratio

Work from home has increased greatly during the COVID-19 pandemic, and concerns have been raised that this would change physical behaviours. In the present study, 11 Brazilian office workers (five women, six men; mean [SD] age 39.3 [9.6] years) wore two triaxial accelerometers fixed on the upper back and right thigh continuously for five days, including a weekend, before COVID-19 (September 2019), and again while working at home during COVID-19 (July 2020). We determined time used in five behaviours: sedentary, standing, light physical activity (LPA), moderate-to-vigorous activity (MVPA), and time-in-bed. Data on these behaviours were processed using Compositional Data Analysis, and behaviours observed pre-COVID19 and during-COVID19 were compared using repeated-measures MANOVA. On workdays during-COVID19, participants spent 667 min sedentary, 176 standing, 74 LPA, 51 MVPA and 472 time-in-bed; corresponding numbers pre-COVID were 689, 180, 81, 72 and 418 min. Tests confirmed that less time was spent in bed pre-COVID19 (log-ratio −0.12 [95% CI −0.19; −0.08]) and more time in MVPA (log-ratio 0.35, [95% CI 0.08; 0.70]). Behaviours during the weekend changed only marginally. While small, this study is the first to report objectively measured physical behaviours during workdays as well as weekends in the same subjects before and during the COVID-19 pandemic.

Download Full-text

LOG-RATIO COMPOSITIONAL DATA ANALYSIS IN ARCHAEOMETRY*

Archaeometry ◽

10.1111/j.1475-4754.2006.00270.x ◽

2006 ◽

Vol 48 (3) ◽

pp. 511-531 ◽

Cited By ~ 52

Author(s):

M. J. BAXTER ◽

I. C. FREESTONE

Keyword(s):

Data Analysis ◽

Compositional Data ◽

Compositional Data Analysis ◽

Log Ratio

Download Full-text