Compositional Data Analysis

Compositional data are nonnegative data carrying relative, rather than absolute, information—these are often data with a constant-sum constraint on the sample values, for example, proportions or percentages summing to 1% or 100%, respectively. Ratios between components of a composition are important since they are unaffected by the particular set of components chosen. Logarithms of ratios (logratios) are the fundamental transformation in the ratio approach to compositional data analysis—all data thus need to be strictly positive, so that zero values present a major problem. Components that group together based on domain knowledge can be amalgamated (i.e., summed) to create new components, and this can alleviate the problem of data zeros. Once compositional data are transformed to logratios, regular univariate and multivariate statistical analysis can be performed, such as dimension reduction and clustering, as well as modeling. Alternative methodologies that come close to the ideals of the logratio approach are also considered, especially those that avoid the problem of data zeros, which is particularly acute in large bioinformatic data sets.

Download Full-text

Compositional data analysis in epidemiology

Statistical Methods in Medical Research ◽

10.1177/0962280216671536 ◽

2016 ◽

Vol 27 (6) ◽

pp. 1878-1891 ◽

Cited By ~ 10

Author(s):

Mehmet C Mert ◽

Peter Filzmoser ◽

Gottfried Endel ◽

Ingrid Wilbacher

Keyword(s):

Data Analysis ◽

Compositional Data ◽

Statistical Analyses ◽

Analysis Approach ◽

Compositional Data Analysis ◽

Data Sets ◽

Multivariate Statistical Analyses ◽

Multivariate Statistical ◽

Data Set ◽

Relative Information

Compositional data analysis refers to analyzing relative information, based on ratios between the variables in a data set. Data from epidemiology are usually treated as absolute information in an analysis. We outline the differences in both approaches for univariate and multivariate statistical analyses, using illustrative data sets from Austrian districts. Not only the results of the analyses can differ, but in particular the interpretation differs. It is demonstrated that the compositional data analysis approach leads to new and interesting insights.

Download Full-text

A Bibliometric Analysis of the 35th anniversary of the paper "The Statistical Analysis of Compositional Data" by John Aitchison (1982)

Austrian Journal of Statistics ◽

10.17713/ajs.v50i2.1066 ◽

2021 ◽

Vol 50 (2) ◽

pp. 38-55

Author(s):

Carolina Navarro ◽

Silvia Gonzalez-Morcillo ◽

Carles Mulet-Forteza ◽

Salvador Linares-Mustaros

Keyword(s):

Data Analysis ◽

Statistical Analysis ◽

Bibliometric Analysis ◽

Compositional Data ◽

Compositional Data Analysis ◽

Royal Statistical Society ◽

Specific Analysis ◽

Wide Range ◽

The World ◽

Bibliometric Approach

This study presents a comprehensive bibliometric analysis of the paper published by John Aitchison in the Journal of the Royal Statistical Society. Series B (Methodological) in 1982. Having recently reached the milestone of 35 years since its publication, this pioneering paper was the first to illustrate the use of the methodology "Compositional Data Analysis" or "CoDA". By October 2019, this paper had received over 780 citations, making it the most widely cited and influential article among those using said methodology. The bibliometric approach used in this study encompasses a wide range of techniques, including a specific analysis of the main authors and institutions to have cited Aitchison' paper. The VOSviewer software was also used for the purpose of developing network maps for said publication. Specifically, the techniques used were co-citations and bibliographic coupling. The results clearly show the significant impact the paper has had on scientific research, having been cited by authors and institutions that publish all around the world.

Download Full-text

Assessing Global Covid-19 Cases Data through Compositional Data Analysis(CoDa)

10.1101/2020.12.17.20248424 ◽

2020 ◽

Author(s):

Luis P.V. Braga ◽

Dina Feigenbaum

Keyword(s):

Data Analysis ◽

Compositional Data ◽

Compositional Data Analysis ◽

Discrete Groups ◽

Data Sets ◽

Cumulative Number ◽

Governmental Agencies ◽

Global Pandemic ◽

Number Of Patients ◽

Log Ratio

AbstractBackgroundCovid-19 cases data pose an enormous challenge to any analysis. The evaluation of such a global pandemic requires matching reports that follow different procedures and even overcoming some countries’ censorship that restricts publications.MethodsThis work proposes a methodology that could assist future studies. Compositional Data Analysis (CoDa) is proposed as the proper approach as Covid-19 cases data is compositional in nature. Under this methodology, for each country three attributes were selected: cumulative number of deaths (D); cumulative number of recovered patients(R); present number of patients (A).ResultsAfter the operation called closure, with c=1, a ternary diagram and Log-Ratio plots, as well as, compositional statistics are presented. Cluster analysis is then applied, splitting the countries into discrete groups.ConclusionsThis methodology can also be applied to other data sets such as countries, cities, provinces or districts in order to help authorities and governmental agencies to improve their actions to fight against a pandemic.

Download Full-text

Compositional analysis: a valid approach to analyze microbiome high-throughput sequencing data

Canadian Journal of Microbiology ◽

10.1139/cjm-2015-0821 ◽

2016 ◽

Vol 62 (8) ◽

pp. 692-703 ◽

Cited By ~ 132

Author(s):

Gregory B. Gloor ◽

Gregor Reid

Keyword(s):

Data Analysis ◽

High Throughput ◽

High Throughput Sequencing ◽

Compositional Data ◽

Critical Role ◽

Compositional Data Analysis ◽

Data Sets ◽

Clear Understanding ◽

Sequencing Data ◽

Microbiome Data

A workshop held at the 2015 annual meeting of the Canadian Society of Microbiologists highlighted compositional data analysis methods and the importance of exploratory data analysis for the analysis of microbiome data sets generated by high-throughput DNA sequencing. A summary of the content of that workshop, a review of new methods of analysis, and information on the importance of careful analyses are presented herein. The workshop focussed on explaining the rationale behind the use of compositional data analysis, and a demonstration of these methods for the examination of 2 microbiome data sets. A clear understanding of bioinformatics methodologies and the type of data being analyzed is essential, given the growing number of studies uncovering the critical role of the microbiome in health and disease and the need to understand alterations to its composition and function following intervention with fecal transplant, probiotics, diet, and pharmaceutical agents.

Download Full-text

Amalgams: data-driven amalgamation for the reference-free dimensionality reduction of zero-laden compositional data

10.1101/2020.02.27.968677 ◽

2020 ◽

Cited By ~ 1

Author(s):

Thomas P. Quinn ◽

Ionas Erb

Keyword(s):

Dimension Reduction ◽

Domain Knowledge ◽

Compositional Data ◽

Real Data ◽

R Package ◽

Data Driven ◽

Alternative Methods ◽

Compositional Data Analysis ◽

Data Sets ◽

Log Ratio

AbstractIn the health sciences, many data sets produced by next-generation sequencing (NGS) only contain relative information because of biological and technical factors that limit the total number of nucleotides observed for a given sample. As mutually dependent elements, it is not possible to interpret any component in isolation, at least without normalization. The field of compositional data analysis (CoDA) has emerged with alternative methods for relative data based on log-ratio transforms. However, NGS data often contain many more features than samples, and thus require creative new ways to reduce the dimensionality of the data without sacrificing interpretability. The summation of parts, called amalgamation, is a practical way of reducing dimensionality, but can introduce a non-linear distortion to the data. We exploit this non-linearity to propose a powerful yet interpretable dimension reduction method. In this report, we present data-driven amalgamation as a new method and conceptual framework for reducing the dimensionality of compositional data. Unlike expert-driven amalgamation which requires prior domain knowledge, our data-driven amalgamation method uses a genetic algorithm to answer the question, “What is the best way to amalgamate the data to achieve the user-defined objective?”. We present a user-friendly R package, called amalgam, that can quickly find the optimal amalgamation to (a) preserve the distance between samples, or (b) classify samples as diseased or not. Our benchmark on 13 real data sets confirm that these amalgamations compete with the state-of-the-art unsupervised and supervised dimension reduction methods in terms of performance, but result in new variables that are much easier to understand: they are groups of features added together.

Download Full-text

Practical aspects of compositional data analysis

Geostatistical Analysis of Compositional Data ◽

10.1093/oso/9780195171662.003.0012 ◽

2004 ◽

Author(s):

Vera Pawlowsky-Glahn ◽

Richardo A. Olea

Keyword(s):

Normal Distribution ◽

Lognormal Distribution ◽

Compositional Data ◽

Theoretical Development ◽

Compositional Data Analysis ◽

Data Sets ◽

Logratio Transformation ◽

Geostatistical Approach ◽

Zero Values ◽

Definition Of

In Chapter 6 we introduce additional aspects of the geostatistical approach presented in the preceding chapters that were not necessary for its theoretical development, but that are essential for the practical application of the method to compositional data. We discuss how to treat zeros in compositional data sets; how to model the required cross-covariances; how to compute expected values and estimation variances for the original, constrained variables; and how to build and interpret confidence intervals for estimated values. As mentioned in Section 2.1, data sets with many zeros are as troublesome in compositional analysis as they are in standard multivariate analysis. In our approach, the additional restriction for compositional data is that zero values are not admissible for modeling. The justification for this restriction can be given using arithmetic arguments. A transformation that uses logarithms cannot be performed on zero values. This is the case for the logratio transformation that leads to the definition of an additive logistic normal distribution, as introduced by Aitchison (1986, p. 113). It is also the case for the additive logistic skew-normal distribution defined in Mateu-Figueras et al. (1998), following previous results by Azzalini and Dalla Valle (1996). The centered logratio transformation and the family of multivariate Box-Cox transformations discussed in Andrews et al. (1971), Rayens and Srinivasan (1991), and Barceló- Vidal (1996) also call for the restriction of zero values. This restriction is certainly a wellspring of discussion, albeit surprisingly so, as nobody would complain about eliminating zeros either by simple suppression of samples or by substitution with reasonable values when dealing with a sample from a lognormal distribution in the univariate case. Recall that the logarithm of zero is undefined and the sample space of the lognormal distribution is the positive real line, excluding the origin. In order to present our position on how to deal with zeros as clearly as possible, let us assume that only one of our components has zeros in some of the samples. Those cases where more than one variable is affected can be analyzed by methods described below.

Download Full-text

Why and How Should Geologists Use Compositional Data Analysis

10.31219/osf.io/nt7q9 ◽

2018 ◽

Author(s):

Ricardo A. Valls

Keyword(s):

Data Analysis ◽

Compositional Data ◽

Special Property ◽

Estimation Procedure ◽

Real Space ◽

Compositional Data Analysis ◽

Constrained Estimation ◽

Multivariate Statistical ◽

Copper Mineralization ◽

Normal Correlation

Compositional data arise naturally in several branches of science, including geology. In geochemistry, for example, these constrained data seem to occur typically, when one normalizes raw data or when one obtains the output from a constrained estimation procedure, such as parts per one, percentages, ppm, ppb, molar concentRations, etc. Compositional data have proved difficult to handle statistically because of the awkward constraint that the components of each vector must sum to unity. The special property of compositional data (the fact that the determinations on each specimen sum to a constant) means that the variables involved in the study occur in constrained space defined by the simplex, a restricted part of real space. Pearson was the first to point out dangers that may befall the analyst who attempts to interpret correlations between Ratios whose numerators and denominators contain common parts. More recently, Aitchison, Pawlowsky-Glahn, S. Thió, and other statisticians have develop the concept of Compositional Data Analysis, pointing out the dangers of misinterpretation of closed data when treated with “normal” statistical methods It is important for geochemists and geologists in general to be aware that the usual multivariate statistical techniques are not applicable to constrained data. It is also important for us to have access to appropriate techniques as they become available. This is the principal aim of this book. From a hypothetical model of a copper mineralization associated to a felsic intrusive, with specific relationships between certain elements, I will show how “normal” correlation methods fail to identify some of such embedded relationships and how we can obtain other spurious correlations. From there, I will test the same model after transforming the data using the CRL, ARL, and IRL transformations with the aid of the CoDaPack software. Since I addressed this publication to geologists and geoscientists in general, I have kept to a minimum the mathematical formulae and did not include any theoretical demonstRation. The “mathematical curios geologist”, if such category exists, can find all of those in a list of recommended sources in the reference section. So let us start by introducing the model of mineralization that we will be testing.

Download Full-text