Amalgams: data-driven amalgamation for the dimensionality reduction of compositional data

Abstract Many next-generation sequencing datasets contain only relative information because of biological and technical factors that limit the total number of transcripts observed for a given sample. It is not possible to interpret any one component in isolation. The field of compositional data analysis has emerged with alternative methods for relative data based on log-ratio transforms. However, these data often contain many more features than samples, and thus require creative new ways to reduce the dimensionality of the data. The summation of parts, called amalgamation, is a practical way of reducing dimensionality, but can introduce a non-linear distortion to the data. We exploit this non-linearity to propose a powerful yet interpretable dimension method called data-driven amalgamation. Our new method, implemented in the user-friendly R package amalgam, can reduce the dimensionality of compositional data by finding amalgamations that optimally (i) preserve the distance between samples, or (ii) classify samples as diseased or not. Our benchmark on 13 real datasets confirm that these amalgamations compete with state-of-the-art methods in terms of performance, but result in new features that are easily understood: they are groups of parts added together.

Download Full-text

Amalgams: data-driven amalgamation for the reference-free dimensionality reduction of zero-laden compositional data

10.1101/2020.02.27.968677 ◽

2020 ◽

Cited By ~ 1

Author(s):

Thomas P. Quinn ◽

Ionas Erb

Keyword(s):

Dimension Reduction ◽

Domain Knowledge ◽

Compositional Data ◽

Real Data ◽

R Package ◽

Data Driven ◽

Alternative Methods ◽

Compositional Data Analysis ◽

Data Sets ◽

Log Ratio

AbstractIn the health sciences, many data sets produced by next-generation sequencing (NGS) only contain relative information because of biological and technical factors that limit the total number of nucleotides observed for a given sample. As mutually dependent elements, it is not possible to interpret any component in isolation, at least without normalization. The field of compositional data analysis (CoDA) has emerged with alternative methods for relative data based on log-ratio transforms. However, NGS data often contain many more features than samples, and thus require creative new ways to reduce the dimensionality of the data without sacrificing interpretability. The summation of parts, called amalgamation, is a practical way of reducing dimensionality, but can introduce a non-linear distortion to the data. We exploit this non-linearity to propose a powerful yet interpretable dimension reduction method. In this report, we present data-driven amalgamation as a new method and conceptual framework for reducing the dimensionality of compositional data. Unlike expert-driven amalgamation which requires prior domain knowledge, our data-driven amalgamation method uses a genetic algorithm to answer the question, “What is the best way to amalgamate the data to achieve the user-defined objective?”. We present a user-friendly R package, called amalgam, that can quickly find the optimal amalgamation to (a) preserve the distance between samples, or (b) classify samples as diseased or not. Our benchmark on 13 real data sets confirm that these amalgamations compete with the state-of-the-art unsupervised and supervised dimension reduction methods in terms of performance, but result in new variables that are much easier to understand: they are groups of features added together.

Download Full-text

Individual categorisation of glucose profiles using compositional data analysis

Statistical Methods in Medical Research ◽

10.1177/0962280218808819 ◽

2018 ◽

Vol 28 (12) ◽

pp. 3550-3567 ◽

Cited By ~ 2

Author(s):

Lyvia Biagi ◽

Arthur Bertachi ◽

Marga Giménez ◽

Ignacio Conget ◽

Jorge Bondia ◽

...

Keyword(s):

Data Analysis ◽

Compositional Data ◽

Glucose Monitoring ◽

Glucose Variability ◽

Compositional Data Analysis ◽

Relative Information ◽

Glucose Profiles ◽

Patient Groups ◽

The Times

The aim of this study was to apply a methodology based on compositional data analysis (CoDA) to categorise glucose profiles obtained from continuous glucose monitoring systems. The methodology proposed considers complete daily glucose profiles obtained from six patients with type 1 diabetes (T1D) who had their glucose monitored for eight weeks. The glucose profiles were distributed into the time spent in six different ranges. The time in one day is finite and limited to 24 h, and the times spent in each of these different ranges are co-dependent and carry only relative information; therefore, CoDA is applied to these profiles. A K-means algorithm was applied to the coordinates obtained from the CoDA to obtain different patterns of days for each patient. Groups of days with relatively high time in the hypo and/or hyperglycaemic ranges and with different glucose variability were observed. Using CoDA of time in different ranges, individual glucose profiles were categorised into groups of days, which can be used by physicians to detect the different conditions of patients and personalise patient's insulin therapy according to each group. This approach can be useful to assist physicians and patients in managing the day-to-day variability that hinders glycaemic control.

Download Full-text

Variable selection in microbiome compositional data analysis

NAR Genomics and Bioinformatics ◽

10.1093/nargab/lqaa029 ◽

2020 ◽

Vol 2 (2) ◽

Cited By ~ 2

Author(s):

Antoni Susin ◽

Yiwen Wang ◽

Kim-Anh Lê Cao ◽

M Luz Calle

Keyword(s):

Data Analysis ◽

Variable Selection ◽

Compositional Data ◽

Penalized Regression ◽

Compositional Data Analysis ◽

Forward Selection ◽

Computationally Efficient ◽

Parsimonious Model ◽

Microbiome Data ◽

Log Ratio

Abstract Though variable selection is one of the most relevant tasks in microbiome analysis, e.g. for the identification of microbial signatures, many studies still rely on methods that ignore the compositional nature of microbiome data. The applicability of compositional data analysis methods has been hampered by the availability of software and the difficulty in interpreting their results. This work is focused on three methods for variable selection that acknowledge the compositional structure of microbiome data: selbal, a forward selection approach for the identification of compositional balances, and clr-lasso and coda-lasso, two penalized regression models for compositional data analysis. This study highlights the link between these methods and brings out some limitations of the centered log-ratio transformation for variable selection. In particular, the fact that it is not subcompositionally consistent makes the microbial signatures obtained from clr-lasso not readily transferable. Coda-lasso is computationally efficient and suitable when the focus is the identification of the most associated microbial taxa. Selbal stands out when the goal is to obtain a parsimonious model with optimal prediction performance, but it is computationally greedy. We provide a reproducible vignette for the application of these methods that will enable researchers to fully leverage their potential in microbiome studies.

Download Full-text

Counts: an outstanding challenge for log-ratio analysis of compositional data in the molecular biosciences

NAR Genomics and Bioinformatics ◽

10.1093/nargab/lqaa040 ◽

2020 ◽

Vol 2 (2) ◽

Cited By ~ 1

Author(s):

David R Lovell ◽

Xin-Yi Chua ◽

Annette McGrath

Keyword(s):

Count Data ◽

Compositional Data ◽

Compositional Data Analysis ◽

Ratio Analysis ◽

Sequencing Technology ◽

Scale Invariant ◽

Measurement And Analysis ◽

Discrete Nature ◽

The Impact ◽

Log Ratio

Abstract Thanks to sequencing technology, modern molecular bioscience datasets are often compositions of counts, e.g. counts of amplicons, mRNAs, etc. While there is growing appreciation that compositional data need special analysis and interpretation, less well understood is the discrete nature of these count compositions (or, as we call them, lattice compositions) and the impact this has on statistical analysis, particularly log-ratio analysis (LRA) of pairwise association. While LRA methods are scale-invariant, count compositional data are not; consequently, the conclusions we draw from LRA of lattice compositions depend on the scale of counts involved. We know that additive variation affects the relative abundance of small counts more than large counts; here we show that additive (quantization) variation comes from the discrete nature of count data itself, as well as (biological) variation in the system under study and (technical) variation from measurement and analysis processes. Variation due to quantization is inevitable, but its impact on conclusions depends on the underlying scale and distribution of counts. We illustrate the different distributions of real molecular bioscience data from different experimental settings to show why it is vital to understand the distributional characteristics of count data before applying and drawing conclusions from compositional data analysis methods.

Download Full-text

Performance Assessment in Water Polo Using Compositional Data Analysis

Journal of Human Kinetics ◽

10.1515/hukin-2016-0043 ◽

2016 ◽

Vol 54 (1) ◽

pp. 143-151 ◽

Cited By ~ 4

Author(s):

Enrique García Ordóñez ◽

María del Carmen Iglesias Pérez ◽

Carlos Touriño González

Keyword(s):

Performance Indicators ◽

Cross Validation ◽

Compositional Data ◽

Compositional Data Analysis ◽

Original Sample ◽

Water Polo ◽

Discriminant Analyses ◽

Multivariate Discriminant ◽

Log Ratio ◽

Match Score

AbstractThe aim of the present study was to identify groups of offensive performance indicators which best discriminated between a match score (favourable, balanced or unfavourable) in water polo. The sample comprised 88 regular season games (2011-2014) from the Spanish Professional Water Polo League. The offensive performance indicators were clustered in five groups: Attacks in relation to the different playing situations; Shots in relation to the different playing situations; Attacks outcome; Origin of shots; Technical execution of shots. The variables of each group had a constant sum which equalled 100%. The data were compositional data, therefore the variables were changed by means of the additive log-ratio (alr) transformation. Multivariate discriminant analyses to compare the match scores were calculated using the transformed variables. With regard to the percentage of right classification, the results showed the group that discriminated the most between the match scores was “Attacks outcome” (60.4% for the original sample and 52.2% for cross-validation). The performance indicators that discriminated the most between the match scores in games with penalties were goals (structure coefficient (SC) = .761), counterattack shots (SC = .541) and counterattacks (SC = .481). In matches without penalties, goals were the primary discriminating factor (SC = .576). This approach provides a new tool to compare the importance of the offensive performance groups and their effect on the match score discrimination.

Download Full-text

Assessing Global Covid-19 Cases Data through Compositional Data Analysis(CoDa)

10.1101/2020.12.17.20248424 ◽

2020 ◽

Author(s):

Luis P.V. Braga ◽

Dina Feigenbaum

Keyword(s):

Data Analysis ◽

Compositional Data ◽

Compositional Data Analysis ◽

Discrete Groups ◽

Data Sets ◽

Cumulative Number ◽

Governmental Agencies ◽

Global Pandemic ◽

Number Of Patients ◽

Log Ratio

AbstractBackgroundCovid-19 cases data pose an enormous challenge to any analysis. The evaluation of such a global pandemic requires matching reports that follow different procedures and even overcoming some countries’ censorship that restricts publications.MethodsThis work proposes a methodology that could assist future studies. Compositional Data Analysis (CoDa) is proposed as the proper approach as Covid-19 cases data is compositional in nature. Under this methodology, for each country three attributes were selected: cumulative number of deaths (D); cumulative number of recovered patients(R); present number of patients (A).ResultsAfter the operation called closure, with c=1, a ternary diagram and Log-Ratio plots, as well as, compositional statistics are presented. Cluster analysis is then applied, splitting the countries into discrete groups.ConclusionsThis methodology can also be applied to other data sets such as countries, cities, provinces or districts in order to help authorities and governmental agencies to improve their actions to fight against a pandemic.

Download Full-text

A field guide for the compositional analysis of any-omics data

GigaScience ◽

10.1093/gigascience/giz107 ◽

2019 ◽

Vol 8 (9) ◽

Cited By ~ 22

Author(s):

Thomas P Quinn ◽

Ionas Erb ◽

Greg Gloor ◽

Cedric Notredame ◽

Mark F Richardson ◽

...

Keyword(s):

Data Analysis ◽

General Solution ◽

Compositional Data ◽

Compositional Analysis ◽

Compositional Data Analysis ◽

Nucleotide Synthesis ◽

Library Size ◽

Next Generation Sequencing Ngs ◽

Concise Guide ◽

Log Ratio

Abstract Background Next-generation sequencing (NGS) has made it possible to determine the sequence and relative abundance of all nucleotides in a biological or environmental sample. A cornerstone of NGS is the quantification of RNA or DNA presence as counts. However, these counts are not counts per se: their magnitude is determined arbitrarily by the sequencing depth, not by the input material. Consequently, counts must undergo normalization prior to use. Conventional normalization methods require a set of assumptions: they assume that the majority of features are unchanged and that all environments under study have the same carrying capacity for nucleotide synthesis. These assumptions are often untestable and may not hold when heterogeneous samples are compared. Results Methods developed within the field of compositional data analysis offer a general solution that is assumption-free and valid for all data. Herein, we synthesize the extant literature to provide a concise guide on how to apply compositional data analysis to NGS count data. Conclusions In highlighting the limitations of total library size, effective library size, and spike-in normalizations, we propose the log-ratio transformation as a general solution to answer the question, “Relative to some important activity of the cell, what is changing?”

Download Full-text

An application of compositional data analysis to multiomic time-series data

NAR Genomics and Bioinformatics ◽

10.1093/nargab/lqaa079 ◽

2020 ◽

Vol 2 (4) ◽

Cited By ~ 1

Author(s):

Laura Sisk-Hackworth ◽

Scott T Kelley

Keyword(s):

Data Analysis ◽

Time Series Data ◽

Compositional Data ◽

Series Data ◽

Compositional Data Analysis ◽

Metabolomics Data ◽

Normalization Methods ◽

Next Generation Sequencing Ngs ◽

Ngs Data ◽

Log Ratio

Abstract Compositional data analysis (CoDA) methods have increased in popularity as a new framework for analyzing next-generation sequencing (NGS) data. CoDA methods, such as the centered log-ratio (clr) transformation, adjust for the compositional nature of NGS counts, which is not addressed by traditional normalization methods. CoDA has only been sparsely applied to NGS data generated from microbial communities or to multiple ‘omics’ datasets. In this study, we applied CoDA methods to analyze NGS and untargeted metabolomic datasets obtained from bacterial and fungal communities. Specifically, we used clr transformation to reanalyze NGS amplicon and metabolomics data from a study investigating the effects of building material type, moisture and time on microbial and metabolomic diversity. Compared to analysis of untransformed data, analysis of clr-transformed data revealed novel relationships and stronger associations between sample conditions and microbial and metabolic community profiles.

Download Full-text

Analysing body composition as compositional data: An exploration of the relationship between body composition, body mass and bone strength

Statistical Methods in Medical Research ◽

10.1177/0962280220955221 ◽

2020 ◽

pp. 096228022095522

Author(s):

D Dumuid ◽

JA Martín-Fernández ◽

S Ellul ◽

RS Kenett ◽

M Wake ◽

...

Keyword(s):

Body Composition ◽

Data Analysis ◽

Health Outcomes ◽

Bone Strength ◽

Body Mass ◽

Statistical Approach ◽

Compositional Data ◽

Fat Free Mass ◽

Compositional Data Analysis ◽

Relative Information

Human body composition is made up of mutually exclusive and exhaustive parts (e.g. %truncal fat, %non-truncal fat and %fat-free mass) which are constrained to sum to the same total (100%). In statistical analyses, individual parts of body composition (e.g. %truncal fat or %fat-free mass) have traditionally been used as proxies for body composition, and have been linked with a range of health outcomes. But analysis of individual parts omits information about the other parts, which are intrinsically co-dependent because of the constant sum constraint of 100%. Further, body mass may be associated with health outcomes. We describe a statistical approach for body composition based on compositional data analysis. The body composition data are expressed as logratios to allow relative information about all the compositional parts to be explored simultaneously in relation to health outcomes. We describe a recent extension to the logratio approach to compositional data analysis which allows absolute information about the total of the compositional parts (body mass) to be considered alongside relative information about body composition. The statistical approach is illustrated by an example that explores the relationships between adults’ body composition, body mass and bone strength.

Download Full-text

Impact of Covariates in Compositional Models and Simplicial Derivatives

Austrian Journal of Statistics ◽

10.17713/ajs.v50i2.1069 ◽

2021 ◽

Vol 50 (2) ◽

pp. 1-15 ◽

Cited By ~ 1

Author(s):

Joanna Morais ◽

Christine Thomas-Agnan

Keyword(s):

Data Analysis ◽

Regression Models ◽

Compositional Data ◽

Explanatory Variable ◽

Regression Equation ◽

Compositional Data Analysis ◽

Linear Regression Models ◽

Explanatory Variables ◽

Relative Information ◽

Single Side

In the framework of Compositional Data Analysis, vectors carrying relative information, also called compositional vectors, can appear in regression models either as dependent or as explanatory variables. In some situations, they can be on both sides of the regression equation. Measuring the marginal impacts of covariates in these types of models is not straightforward since a change in one component of a closed composition automatically affects the rest of the composition. Previous work by the authors has shown how to measure, compute and interpret these marginal impacts in the case of linear regression models with compositions on both sides of the equation. The resulting natural interpretation is in terms of an elasticity, a quantity commonly used in econometrics and marketing applications. They also demonstrate the link between these elasticities and simplicial derivatives. The aim of this contribution is to extend these results to other situations, namely when the compositional vector is on a single side of the regression equation. In these cases, the marginal impact is related to a semi-elasticity and also linked to some simplicial derivative. Moreover we consider the possibility that a total variable is used as an explanatory variable, with several possible interpretations of this total and we derive the elasticity formulas in that case.

Download Full-text