scholarly journals Compositional uncertainty should not be ignored in high-throughput sequencing data analysis

2016 ◽  
Vol 45 (4) ◽  
pp. 73-87 ◽  
Author(s):  
Gregory Brian Gloor ◽  
Jean M. Macklaim ◽  
Michael Vu ◽  
Andrew D. Fernandes

High throughput sequencing generates sparse compositional data, yet these datasets are rarely analyzed using a compositional approach. In addition, the variation inherent in these datasets is rarely acknowledged, but ignoring it can result in many false positive inferences. We demonstrate that examination of point estimates of the data can result in false positive results, even with appropriate zero replacement approaches, using an in vitro selection dataset with an outside standard of truth. The variation inherent in real high-throughput sequencing datasets is demonstrated, and we show that this varia- tion can be approximated, and hence accounted for, by Monte-Carlo sampling from the Dirichlet distribution. This approximation when used by itself is itself problematic, but becomes useful when coupled with a log-ratio approach commonly used in compositional data analysis. Thus, the approach illustrated here that merges Bayesian estimation with principles of compositional data analysis should be generally useful for high-dimensional count compositional data of the type generated by high throughput sequencing. 

2016 ◽  
Vol 62 (8) ◽  
pp. 692-703 ◽  
Author(s):  
Gregory B. Gloor ◽  
Gregor Reid

A workshop held at the 2015 annual meeting of the Canadian Society of Microbiologists highlighted compositional data analysis methods and the importance of exploratory data analysis for the analysis of microbiome data sets generated by high-throughput DNA sequencing. A summary of the content of that workshop, a review of new methods of analysis, and information on the importance of careful analyses are presented herein. The workshop focussed on explaining the rationale behind the use of compositional data analysis, and a demonstration of these methods for the examination of 2 microbiome data sets. A clear understanding of bioinformatics methodologies and the type of data being analyzed is essential, given the growing number of studies uncovering the critical role of the microbiome in health and disease and the need to understand alterations to its composition and function following intervention with fecal transplant, probiotics, diet, and pharmaceutical agents.


2020 ◽  
Author(s):  
Jacob Bien ◽  
Xiaohan Yan ◽  
Léo Simpson ◽  
Christian L. Müller

AbstractModern high-throughput sequencing technologies provide low-cost microbiome survey data across all habitats of life at unprecedented scale. At the most granular level, the primary data consist of sparse counts of amplicon sequence variants or operational taxonomic units that are associated with taxonomic and phylogenetic group information. In this contribution, we leverage the hierarchical structure of amplicon data and propose a data-driven, parameter-free, and scalable tree-guided aggregation framework to associate microbial subcompositions with response variables of interest. The excess number of zero or low count measurements at the read level forces traditional microbiome data analysis workflows to remove rare sequencing variants or group them by a fixed taxonomic rank, such as genus or phylum, or by phylogenetic similarity. By contrast, our framework, which we call trac (tree-aggregation of compositional data), learns data-adaptive taxon aggregation levels for predictive modeling making user-defined aggregation obsolete while simultaneously integrating seamlessly into the compositional data analysis framework. We illustrate the versatility of our framework in the context of large-scale regression problems in human-gut, soil, and marine microbial ecosystems. We posit that the inferred aggregation levels provide highly interpretable taxon groupings that can help microbial ecologists gain insights into the structure and functioning of the underlying ecosystem of interest.


2019 ◽  
Vol 76 (Suppl 1) ◽  
pp. A42.3-A43
Author(s):  
Suzanne Merkus ◽  
Lars-Kristian Lunde ◽  
Markus Koch ◽  
Stein Knardahl ◽  
Kaj Bo Veiersted

PurposeTo use a compositional data analysis approach and objective exposure assessments to study the association between the duration of arm elevation and the course of neck and shoulder pain (NSP) during a 2-year follow-up in physically demanding occupations.MethodsConstruction (n=59) and healthcare (n=59) employees wore accelerometers on the dominant upper-arm during a full working day at baseline. Objective assessments using accelerometers addresses biases found in previous studies that estimate duration of arm elevation with self-reports. At baseline and every 6 months for two years, participants reported on NSP (scale 0–3). Duration of arm elevation within predefined ranges (<30°; 30–60°; ≥60) formed the parts of the composition. Compositional data analysis is a new statistical analysis method within occupational health and it is the correct way of analysing data with a compositional nature. The associations between the relative importance of the duration within the levels of arm elevation and the course of NSP during the 2-year follow-up were estimated with compositional linear mixed models, adjusted for confounders.ResultsIn non-adjusted analyses, only duration arm elevation <30° was associated with NSP at baseline (β = 0.37; p=0.015). Duration arm elevation <30° had a tendency to be associated with an improvement in NSP over the 2-year follow-up (<30°*time (β=-0.07; p=0.089)). Neither duration 30–60° nor ≥60 were associated with the course of NSP during follow-up. After adjusting for confounders, none of the durations of arm elevation were associated with the course of NSP over the 2-year period (<30° and NSP (β = 0.20; p=0.126); <30°*time (β=-0.06; p=0.097)).ConclusionAmong construction and healthcare personnel, duration of working in awkward arm elevation postures was not associated with the course of NSP over a 2-year period. Arm elevation alone, without considering force exertion, may not be sufficient to influence the course of NSP.


2020 ◽  
Vol 2 (2) ◽  
Author(s):  
Antoni Susin ◽  
Yiwen Wang ◽  
Kim-Anh Lê Cao ◽  
M Luz Calle

Abstract Though variable selection is one of the most relevant tasks in microbiome analysis, e.g. for the identification of microbial signatures, many studies still rely on methods that ignore the compositional nature of microbiome data. The applicability of compositional data analysis methods has been hampered by the availability of software and the difficulty in interpreting their results. This work is focused on three methods for variable selection that acknowledge the compositional structure of microbiome data: selbal, a forward selection approach for the identification of compositional balances, and clr-lasso and coda-lasso, two penalized regression models for compositional data analysis. This study highlights the link between these methods and brings out some limitations of the centered log-ratio transformation for variable selection. In particular, the fact that it is not subcompositionally consistent makes the microbial signatures obtained from clr-lasso not readily transferable. Coda-lasso is computationally efficient and suitable when the focus is the identification of the most associated microbial taxa. Selbal stands out when the goal is to obtain a parsimonious model with optimal prediction performance, but it is computationally greedy. We provide a reproducible vignette for the application of these methods that will enable researchers to fully leverage their potential in microbiome studies.


2021 ◽  
Author(s):  
Lukáš Rubín ◽  
Aleš Gába ◽  
Jana Pelclová ◽  
Nikola Štefelová ◽  
Lukáš Jakubec ◽  
...  

Abstract Background: To date, no longitudinal study using a compositional approach has examined sedentary behavior (SB) patterns in relation to adiposity in the pediatric population. Therefore, our aims were to (1) investigate the changes in SB patterns and adiposity from childhood to adolescence, (2) analyze the prospective compositional associations between changes in SB patterns and adiposity, and (3) estimate the changes in adiposity associated with substituting SB with physical activity (PA) of different intensities.Methods: The study presents a longitudinal design with a 5-year follow-up. A total of 88 participants (61% girls) were included in the analysis. PA and SB were monitored for seven consecutive days using a hip-worn accelerometer. Adiposity markers (fat mass percentage [FM%], fat mass index [FMI], and visceral adiposity tissue [VAT]) were assessed using the multi-frequency bioimpedance analysis. The prospective associations were examined using compositional data analysis. Results: Over the follow-up period, the proportion of time spent in total SB increased by 154.8 min/day (p < 0.001). The increase in total SB was caused mainly by an increase in middle and long sedentary bouts, as these SB periods increased by 79.8 min/day and 62 min/day (p < 0.001 for both), respectively. FM%, FMI, and VAT increased by 2.4 percent points, 1.0 kg/m2, and 31.5 cm2 (p < 0.001 for all), respectively. Relative to the remaining movement behaviors, the increase in time spent in middle sedentary bouts was significantly associated with higher FM% (βilr1 = 0.27, 95% confidence interval [CI]: 0.02 to 0.53) at follow-up. Lower VAT by 3.3% (95% CI: 0.8 to 5.7), 3.8% (95% CI: 0.03 to 7.4), 3.9% (95% CI: 0.8 to 6.9), and 3.8% (95% CI: 0.7 to 6.9) was associated with substituting 15 min/week spent in total SB and in short, middle, and long sedentary bouts, respectively, with an equivalent amount of time spent in vigorous PA.Conclusions: This study showed unfavorable changes in SB patterns and adiposity status in the transition from childhood to adolescence. Incorporating high-intensity PA at the expense of SB appears to be an appropriate approach to reduce the risk of excess adiposity in the pediatric population.


PeerJ ◽  
2018 ◽  
Vol 6 ◽  
pp. e5643 ◽  
Author(s):  
Fiona Chong ◽  
Matthew Spencer

Ecologists often analyze relative abundances, which are an example of compositional data. However, they have made surprisingly little use of recent advances in the field of compositional data analysis. Compositions form a vector space in which addition and scalar multiplication are replaced by operations known as perturbation and powering. This algebraic structure makes it easy to understand how relative abundances change along environmental gradients. We illustrate this with an analysis of changes in hard-substrate marine communities along a depth gradient. We fit a quadratic multivariate regression model with multinomial observations to point count data obtained from video transects. As well as being an appropriate observation model in this case, the multinomial deals with the problem of zeros, which often makes compositional data analysis difficult. We show how the algebra of compositions can be used to understand patterns in dissimilarity. We use the calculus of simplex-valued functions to estimate rates of change, and to summarize the structure of the community over a vertical slice. We discuss the benefits of the compositional approach in the interpretation and visualization of relative abundance data.


2020 ◽  
Author(s):  
Luis P.V. Braga ◽  
Dina Feigenbaum

AbstractBackgroundCovid-19 cases data pose an enormous challenge to any analysis. The evaluation of such a global pandemic requires matching reports that follow different procedures and even overcoming some countries’ censorship that restricts publications.MethodsThis work proposes a methodology that could assist future studies. Compositional Data Analysis (CoDa) is proposed as the proper approach as Covid-19 cases data is compositional in nature. Under this methodology, for each country three attributes were selected: cumulative number of deaths (D); cumulative number of recovered patients(R); present number of patients (A).ResultsAfter the operation called closure, with c=1, a ternary diagram and Log-Ratio plots, as well as, compositional statistics are presented. Cluster analysis is then applied, splitting the countries into discrete groups.ConclusionsThis methodology can also be applied to other data sets such as countries, cities, provinces or districts in order to help authorities and governmental agencies to improve their actions to fight against a pandemic.


GigaScience ◽  
2019 ◽  
Vol 8 (9) ◽  
Author(s):  
Thomas P Quinn ◽  
Ionas Erb ◽  
Greg Gloor ◽  
Cedric Notredame ◽  
Mark F Richardson ◽  
...  

Abstract Background Next-generation sequencing (NGS) has made it possible to determine the sequence and relative abundance of all nucleotides in a biological or environmental sample. A cornerstone of NGS is the quantification of RNA or DNA presence as counts. However, these counts are not counts per se: their magnitude is determined arbitrarily by the sequencing depth, not by the input material. Consequently, counts must undergo normalization prior to use. Conventional normalization methods require a set of assumptions: they assume that the majority of features are unchanged and that all environments under study have the same carrying capacity for nucleotide synthesis. These assumptions are often untestable and may not hold when heterogeneous samples are compared. Results Methods developed within the field of compositional data analysis offer a general solution that is assumption-free and valid for all data. Herein, we synthesize the extant literature to provide a concise guide on how to apply compositional data analysis to NGS count data. Conclusions In highlighting the limitations of total library size, effective library size, and spike-in normalizations, we propose the log-ratio transformation as a general solution to answer the question, “Relative to some important activity of the cell, what is changing?”


2020 ◽  
Vol 2 (4) ◽  
Author(s):  
Laura Sisk-Hackworth ◽  
Scott T Kelley

Abstract Compositional data analysis (CoDA) methods have increased in popularity as a new framework for analyzing next-generation sequencing (NGS) data. CoDA methods, such as the centered log-ratio (clr) transformation, adjust for the compositional nature of NGS counts, which is not addressed by traditional normalization methods. CoDA has only been sparsely applied to NGS data generated from microbial communities or to multiple ‘omics’ datasets. In this study, we applied CoDA methods to analyze NGS and untargeted metabolomic datasets obtained from bacterial and fungal communities. Specifically, we used clr transformation to reanalyze NGS amplicon and metabolomics data from a study investigating the effects of building material type, moisture and time on microbial and metabolomic diversity. Compared to analysis of untransformed data, analysis of clr-transformed data revealed novel relationships and stronger associations between sample conditions and microbial and metabolic community profiles.


2021 ◽  
Vol 11 (1) ◽  
Author(s):  
Jacob Bien ◽  
Xiaohan Yan ◽  
Léo Simpson ◽  
Christian L. Müller

AbstractModern high-throughput sequencing technologies provide low-cost microbiome survey data across all habitats of life at unprecedented scale. At the most granular level, the primary data consist of sparse counts of amplicon sequence variants or operational taxonomic units that are associated with taxonomic and phylogenetic group information. In this contribution, we leverage the hierarchical structure of amplicon data and propose a data-driven and scalable tree-guided aggregation framework to associate microbial subcompositions with response variables of interest. The excess number of zero or low count measurements at the read level forces traditional microbiome data analysis workflows to remove rare sequencing variants or group them by a fixed taxonomic rank, such as genus or phylum, or by phylogenetic similarity. By contrast, our framework, which we call  (ee-ggregation of ompositional data), learns data-adaptive taxon aggregation levels for predictive modeling, greatly reducing the need for user-defined aggregation in preprocessing while simultaneously integrating seamlessly into the compositional data analysis framework. We illustrate the versatility of our framework in the context of large-scale regression problems in human gut, soil, and marine microbial ecosystems. We posit that the inferred aggregation levels provide highly interpretable taxon groupings that can help microbiome researchers gain insights into the structure and functioning of the underlying ecosystem of interest.


Sign in / Sign up

Export Citation Format

Share Document