scholarly journals Zero problems with compositional data of physical behaviors: a comparison of three zero replacement methods

Author(s):  
Charlotte Lund Rasmussen ◽  
Javier Palarea-Albaladejo ◽  
Melker Staffan Johansson ◽  
Patrick Crowley ◽  
Matthew Leigh Stevens ◽  
...  

Abstract Background Researchers applying compositional data analysis to time-use data (e.g., time spent in physical behaviors) often face the problem of zeros, that is, recordings of zero time spent in any of the studied behaviors. Zeros hinder the application of compositional data analysis because the analysis is based on log-ratios. One way to overcome this challenge is to replace the zeros with sensible small values. The aim of this study was to compare the performance of three existing replacement methods used within physical behavior time-use epidemiology: simple replacement, multiplicative replacement, and log-ratio expectation-maximization (lrEM) algorithm. Moreover, we assessed the consequence of choosing replacement values higher than the lowest observed value for a given behavior. Method Using a complete dataset based on accelerometer data from 1310 Danish adults as reference, multiple datasets were simulated across six scenarios of zeros (5–30% zeros in 5% increments). Moreover, four examples were produced based on real data, in which, 10 and 20% zeros were imposed and replaced using a replacement value of 0.5 min, 65% of the observation threshold, or an estimated value below the observation threshold. For the simulation study and the examples, the zeros were replaced using the three replacement methods and the degree of distortion introduced was assessed by comparison with the complete dataset. Results The lrEM method outperformed the other replacement methods as it had the smallest influence on the structure of relative variation of the datasets. Both the simple and multiplicative replacements introduced higher distortion, particularly in scenarios with more than 10% zeros; although the latter, like the lrEM, does preserve the ratios between behaviors with no zeros. The examples revealed that replacing zeros with a value higher than the observation threshold severely affected the structure of relative variation. Conclusions Given our findings, we encourage the use of replacement methods that preserve the relative structure of physical behavior data, as achieved by the multiplicative and lrEM replacements, and to avoid simple replacement. Moreover, we do not recommend replacing zeros with values higher than the lowest observed value for a behavior.

Biometrika ◽  
2021 ◽  
Author(s):  
Pixu Shi ◽  
Yuchen Zhou ◽  
Anru R Zhang

Abstract In microbiome and genomic studies, the regression of compositional data has been a crucial tool for identifying microbial taxa or genes that are associated with clinical phenotypes. To account for the variation in sequencing depth, the classic log-contrast model is often used where read counts are normalized into compositions. However, zero read counts and the randomness in covariates remain critical issues. In this article, we introduce a surprisingly simple, interpretable, and efficient method for the estimation of compositional data regression through the lens of a novel high-dimensional log-error-in-variable regression model. The proposed method provides both corrections on sequencing data with possible overdispersion and simultaneously avoids any subjective imputation of zero read counts. We provide theoretical justifications with matching upper and lower bounds for the estimation error. The merit of the procedure is illustrated through real data analysis and simulation studies.


Nutrients ◽  
2020 ◽  
Vol 12 (8) ◽  
pp. 2280
Author(s):  
Chloe Clifford Astbury ◽  
Louise Foley ◽  
Tarra L. Penney ◽  
Jean Adams

Background: Increased time spent on home food preparation is associated with higher diet quality, but a lack of time is often reported as a barrier to this practice. We compared time use in individuals who do more versus less foodwork (tasks required to feed ourselves and our households, including home food preparation). Methods: Cross-sectional analysis of the UK Time Use Survey 2014–15, participants aged 16+ (N = 6143). Time use over 24 h was attributed to seven compositional parts: personal care; sleep; eating; physical activity; leisure screen time; work (paid and unpaid); and socialising and hobbies. Participants were categorised as doing no, ‘some’ (<70 min), or ‘more’ foodwork (≥70 min). We used compositional data analysis to test whether time-use composition varied between these participant groups, determine which of the parts varied between groups, and test for differences across population subgroups. Results: Participants who spent more time on foodwork spent less time on sleep, eating, and personal care and more time on work. Women who did more foodwork spent less time on personal care, socialising, and hobbies, which was not the case for men. Conclusion: Those who seek to encourage home food preparation should be aware of the associations between foodwork and other activities and design their interventions to guard against unintended consequences.


2020 ◽  
Vol 2 (2) ◽  
Author(s):  
Antoni Susin ◽  
Yiwen Wang ◽  
Kim-Anh Lê Cao ◽  
M Luz Calle

Abstract Though variable selection is one of the most relevant tasks in microbiome analysis, e.g. for the identification of microbial signatures, many studies still rely on methods that ignore the compositional nature of microbiome data. The applicability of compositional data analysis methods has been hampered by the availability of software and the difficulty in interpreting their results. This work is focused on three methods for variable selection that acknowledge the compositional structure of microbiome data: selbal, a forward selection approach for the identification of compositional balances, and clr-lasso and coda-lasso, two penalized regression models for compositional data analysis. This study highlights the link between these methods and brings out some limitations of the centered log-ratio transformation for variable selection. In particular, the fact that it is not subcompositionally consistent makes the microbial signatures obtained from clr-lasso not readily transferable. Coda-lasso is computationally efficient and suitable when the focus is the identification of the most associated microbial taxa. Selbal stands out when the goal is to obtain a parsimonious model with optimal prediction performance, but it is computationally greedy. We provide a reproducible vignette for the application of these methods that will enable researchers to fully leverage their potential in microbiome studies.


2020 ◽  
Author(s):  
Luis P.V. Braga ◽  
Dina Feigenbaum

AbstractBackgroundCovid-19 cases data pose an enormous challenge to any analysis. The evaluation of such a global pandemic requires matching reports that follow different procedures and even overcoming some countries’ censorship that restricts publications.MethodsThis work proposes a methodology that could assist future studies. Compositional Data Analysis (CoDa) is proposed as the proper approach as Covid-19 cases data is compositional in nature. Under this methodology, for each country three attributes were selected: cumulative number of deaths (D); cumulative number of recovered patients(R); present number of patients (A).ResultsAfter the operation called closure, with c=1, a ternary diagram and Log-Ratio plots, as well as, compositional statistics are presented. Cluster analysis is then applied, splitting the countries into discrete groups.ConclusionsThis methodology can also be applied to other data sets such as countries, cities, provinces or districts in order to help authorities and governmental agencies to improve their actions to fight against a pandemic.


Author(s):  
Verónica Cabanas-Sánchez ◽  
Irene Esteban-Cornejo ◽  
Esther García-Esquinas ◽  
Rosario Ortolá ◽  
Ignacio Ara ◽  
...  

Abstract Background Most studies on the effects of sleep, sedentary behavior (SB), and physical activity (PA) on mental health did not account for the intrinsically compositional nature of the time spent in several behaviors. Thus, we examined the cross-sectional and prospective associations of device-measured compositional time in sleep, SB, light PA (LPA) and moderate-to-vigorous PA (MVPA) with depression symptoms, loneliness, happiness, and global mental health in older people (≥ 65 years). Methods Data were taken from the Seniors-ENRICA-2 study, with assessments in 2015–2017 (wave 0) and 2018–2019 (wave 1). Time spent in sleep, SB, LPA and MVPA was assessed by wrist-worn accelerometers. Depression symptoms, loneliness, happiness, and global mental health were self-reported using validated questionnaires. Analyses were performed using a compositional data analysis (CoDA) paradigm and adjusted for potential confounders. Results In cross-sectional analyses at wave 0 (n = 2489), time-use composition as a whole was associated with depression and happiness (all p < 0.01). The time spent in MVPA relative to other behaviors was beneficially associated with depression (γ = -0.397, p < 0.001), loneliness (γ = -0.124, p = 0.017) and happiness (γ = 0.243, p < 0.001). Hypothetically, replacing 30-min of Sleep, SB or LPA with MVPA was beneficially cross-sectionally related with depression (effect size [ES] ranged -0.326 to -0.246), loneliness (ES ranged -0.118 to -0.073), and happiness (ES ranged 0.152 to 0.172). In prospective analyses (n = 1679), MVPA relative to other behaviors at baseline, was associated with favorable changes in global mental health (γ = 0.892, p = 0.049). We observed a beneficial prospective effect on global mental health when 30-min of sleep (ES = 0.521), SB (ES = 0.479) or LPA (ES = 0.755) were theoretically replaced for MVPA. Conclusions MVPA was cross-sectionally related with reduced depression symptoms and loneliness and elevated level of happiness, and prospectively related with enhanced global mental health. Compositional isotemporal analyses showed that hypothetically replacing sleep, SB or LPA with MVPA could result in modest but significantly improvements on mental health indicators. Our findings add evidence to the emerging body of research on 24-h time-use and health using CoDA and suggest an integrated role of daily behaviors on mental health in older people.


GigaScience ◽  
2019 ◽  
Vol 8 (9) ◽  
Author(s):  
Thomas P Quinn ◽  
Ionas Erb ◽  
Greg Gloor ◽  
Cedric Notredame ◽  
Mark F Richardson ◽  
...  

Abstract Background Next-generation sequencing (NGS) has made it possible to determine the sequence and relative abundance of all nucleotides in a biological or environmental sample. A cornerstone of NGS is the quantification of RNA or DNA presence as counts. However, these counts are not counts per se: their magnitude is determined arbitrarily by the sequencing depth, not by the input material. Consequently, counts must undergo normalization prior to use. Conventional normalization methods require a set of assumptions: they assume that the majority of features are unchanged and that all environments under study have the same carrying capacity for nucleotide synthesis. These assumptions are often untestable and may not hold when heterogeneous samples are compared. Results Methods developed within the field of compositional data analysis offer a general solution that is assumption-free and valid for all data. Herein, we synthesize the extant literature to provide a concise guide on how to apply compositional data analysis to NGS count data. Conclusions In highlighting the limitations of total library size, effective library size, and spike-in normalizations, we propose the log-ratio transformation as a general solution to answer the question, “Relative to some important activity of the cell, what is changing?”


2020 ◽  
Vol 2 (4) ◽  
Author(s):  
Laura Sisk-Hackworth ◽  
Scott T Kelley

Abstract Compositional data analysis (CoDA) methods have increased in popularity as a new framework for analyzing next-generation sequencing (NGS) data. CoDA methods, such as the centered log-ratio (clr) transformation, adjust for the compositional nature of NGS counts, which is not addressed by traditional normalization methods. CoDA has only been sparsely applied to NGS data generated from microbial communities or to multiple ‘omics’ datasets. In this study, we applied CoDA methods to analyze NGS and untargeted metabolomic datasets obtained from bacterial and fungal communities. Specifically, we used clr transformation to reanalyze NGS amplicon and metabolomics data from a study investigating the effects of building material type, moisture and time on microbial and metabolomic diversity. Compared to analysis of untransformed data, analysis of clr-transformed data revealed novel relationships and stronger associations between sample conditions and microbial and metabolic community profiles.


Author(s):  
Thomas P. Quinn ◽  
Ionas Erb

AbstractIn the health sciences, many data sets produced by next-generation sequencing (NGS) only contain relative information because of biological and technical factors that limit the total number of nucleotides observed for a given sample. As mutually dependent elements, it is not possible to interpret any component in isolation, at least without normalization. The field of compositional data analysis (CoDA) has emerged with alternative methods for relative data based on log-ratio transforms. However, NGS data often contain many more features than samples, and thus require creative new ways to reduce the dimensionality of the data without sacrificing interpretability. The summation of parts, called amalgamation, is a practical way of reducing dimensionality, but can introduce a non-linear distortion to the data. We exploit this non-linearity to propose a powerful yet interpretable dimension reduction method. In this report, we present data-driven amalgamation as a new method and conceptual framework for reducing the dimensionality of compositional data. Unlike expert-driven amalgamation which requires prior domain knowledge, our data-driven amalgamation method uses a genetic algorithm to answer the question, “What is the best way to amalgamate the data to achieve the user-defined objective?”. We present a user-friendly R package, called amalgam, that can quickly find the optimal amalgamation to (a) preserve the distance between samples, or (b) classify samples as diseased or not. Our benchmark on 13 real data sets confirm that these amalgamations compete with the state-of-the-art unsupervised and supervised dimension reduction methods in terms of performance, but result in new variables that are much easier to understand: they are groups of features added together.


2018 ◽  
Author(s):  
Thomas P. Quinn ◽  
Ionas Erb ◽  
Greg Gloor ◽  
Cedric Notredame ◽  
Mark F. Richardson ◽  
...  

AbstractNext-generation sequencing (NGS) has made it possible to determine the sequence and relative abundance of all nucleotides in a biological or environmental sample. Today, NGS is routinely used to understand many important topics in biology from human disease to microorganism diversity. A cornerstone of NGS is the quantification of RNA or DNA presence as counts. However, these counts are not counts per se: the magnitude of the counts are determined arbitrarily by the sequencing depth, not by the input material. Consequently, counts must undergo normalization prior to use. Conventional normalization methods require a set of assumptions: they assume that the majority of features are unchanged, and that all environments under study have the same carrying capacity for nucleotide synthesis. These assumptions are often untestable and may not hold when comparing heterogeneous samples (e.g., samples collected across distinct cancers or tissues). Instead, methods developed within the field of compositional data analysis offer a general solution that is assumption-free and valid for all data. In this manuscript, we synthesize the extant literature to provide a concise guide on how to apply compositional data analysis to NGS count data. In doing so, we review zero replacement, differential abundance analysis, and within-group and between-group coordination analysis. We then discuss how this pipeline can accommodate complex study design, facilitate the analysis of vertically and horizontally integrated data, including multiomics data, and further extend to single-cell sequencing data. In highlighting the limitations of total library size, effective library size, and spike-in normalizations, we propose the log-ratio transformation as a general solution to answer the question, “Relative to some important activity of the cell, what is changing?”. Taken together, this manuscript establishes the first fully comprehensive analysis protocol that is suitable for any and all -omics data.


Sign in / Sign up

Export Citation Format

Share Document