scholarly journals Amalgams: data-driven amalgamation for the reference-free dimensionality reduction of zero-laden compositional data

Author(s):  
Thomas P. Quinn ◽  
Ionas Erb

AbstractIn the health sciences, many data sets produced by next-generation sequencing (NGS) only contain relative information because of biological and technical factors that limit the total number of nucleotides observed for a given sample. As mutually dependent elements, it is not possible to interpret any component in isolation, at least without normalization. The field of compositional data analysis (CoDA) has emerged with alternative methods for relative data based on log-ratio transforms. However, NGS data often contain many more features than samples, and thus require creative new ways to reduce the dimensionality of the data without sacrificing interpretability. The summation of parts, called amalgamation, is a practical way of reducing dimensionality, but can introduce a non-linear distortion to the data. We exploit this non-linearity to propose a powerful yet interpretable dimension reduction method. In this report, we present data-driven amalgamation as a new method and conceptual framework for reducing the dimensionality of compositional data. Unlike expert-driven amalgamation which requires prior domain knowledge, our data-driven amalgamation method uses a genetic algorithm to answer the question, “What is the best way to amalgamate the data to achieve the user-defined objective?”. We present a user-friendly R package, called amalgam, that can quickly find the optimal amalgamation to (a) preserve the distance between samples, or (b) classify samples as diseased or not. Our benchmark on 13 real data sets confirm that these amalgamations compete with the state-of-the-art unsupervised and supervised dimension reduction methods in terms of performance, but result in new variables that are much easier to understand: they are groups of features added together.

2020 ◽  
Vol 2 (4) ◽  
Author(s):  
Thomas P Quinn ◽  
Ionas Erb

Abstract Many next-generation sequencing datasets contain only relative information because of biological and technical factors that limit the total number of transcripts observed for a given sample. It is not possible to interpret any one component in isolation. The field of compositional data analysis has emerged with alternative methods for relative data based on log-ratio transforms. However, these data often contain many more features than samples, and thus require creative new ways to reduce the dimensionality of the data. The summation of parts, called amalgamation, is a practical way of reducing dimensionality, but can introduce a non-linear distortion to the data. We exploit this non-linearity to propose a powerful yet interpretable dimension method called data-driven amalgamation. Our new method, implemented in the user-friendly R package amalgam, can reduce the dimensionality of compositional data by finding amalgamations that optimally (i) preserve the distance between samples, or (ii) classify samples as diseased or not. Our benchmark on 13 real datasets confirm that these amalgamations compete with state-of-the-art methods in terms of performance, but result in new features that are easily understood: they are groups of parts added together.


2020 ◽  
Author(s):  
Luis P.V. Braga ◽  
Dina Feigenbaum

AbstractBackgroundCovid-19 cases data pose an enormous challenge to any analysis. The evaluation of such a global pandemic requires matching reports that follow different procedures and even overcoming some countries’ censorship that restricts publications.MethodsThis work proposes a methodology that could assist future studies. Compositional Data Analysis (CoDa) is proposed as the proper approach as Covid-19 cases data is compositional in nature. Under this methodology, for each country three attributes were selected: cumulative number of deaths (D); cumulative number of recovered patients(R); present number of patients (A).ResultsAfter the operation called closure, with c=1, a ternary diagram and Log-Ratio plots, as well as, compositional statistics are presented. Cluster analysis is then applied, splitting the countries into discrete groups.ConclusionsThis methodology can also be applied to other data sets such as countries, cities, provinces or districts in order to help authorities and governmental agencies to improve their actions to fight against a pandemic.


Author(s):  
Charlotte Lund Rasmussen ◽  
Javier Palarea-Albaladejo ◽  
Melker Staffan Johansson ◽  
Patrick Crowley ◽  
Matthew Leigh Stevens ◽  
...  

Abstract Background Researchers applying compositional data analysis to time-use data (e.g., time spent in physical behaviors) often face the problem of zeros, that is, recordings of zero time spent in any of the studied behaviors. Zeros hinder the application of compositional data analysis because the analysis is based on log-ratios. One way to overcome this challenge is to replace the zeros with sensible small values. The aim of this study was to compare the performance of three existing replacement methods used within physical behavior time-use epidemiology: simple replacement, multiplicative replacement, and log-ratio expectation-maximization (lrEM) algorithm. Moreover, we assessed the consequence of choosing replacement values higher than the lowest observed value for a given behavior. Method Using a complete dataset based on accelerometer data from 1310 Danish adults as reference, multiple datasets were simulated across six scenarios of zeros (5–30% zeros in 5% increments). Moreover, four examples were produced based on real data, in which, 10 and 20% zeros were imposed and replaced using a replacement value of 0.5 min, 65% of the observation threshold, or an estimated value below the observation threshold. For the simulation study and the examples, the zeros were replaced using the three replacement methods and the degree of distortion introduced was assessed by comparison with the complete dataset. Results The lrEM method outperformed the other replacement methods as it had the smallest influence on the structure of relative variation of the datasets. Both the simple and multiplicative replacements introduced higher distortion, particularly in scenarios with more than 10% zeros; although the latter, like the lrEM, does preserve the ratios between behaviors with no zeros. The examples revealed that replacing zeros with a value higher than the observation threshold severely affected the structure of relative variation. Conclusions Given our findings, we encourage the use of replacement methods that preserve the relative structure of physical behavior data, as achieved by the multiplicative and lrEM replacements, and to avoid simple replacement. Moreover, we do not recommend replacing zeros with values higher than the lowest observed value for a behavior.


2021 ◽  
Vol 8 (1) ◽  
pp. 271-299
Author(s):  
Michael Greenacre

Compositional data are nonnegative data carrying relative, rather than absolute, information—these are often data with a constant-sum constraint on the sample values, for example, proportions or percentages summing to 1% or 100%, respectively. Ratios between components of a composition are important since they are unaffected by the particular set of components chosen. Logarithms of ratios (logratios) are the fundamental transformation in the ratio approach to compositional data analysis—all data thus need to be strictly positive, so that zero values present a major problem. Components that group together based on domain knowledge can be amalgamated (i.e., summed) to create new components, and this can alleviate the problem of data zeros. Once compositional data are transformed to logratios, regular univariate and multivariate statistical analysis can be performed, such as dimension reduction and clustering, as well as modeling. Alternative methodologies that come close to the ideals of the logratio approach are also considered, especially those that avoid the problem of data zeros, which is particularly acute in large bioinformatic data sets.


Biometrika ◽  
2021 ◽  
Author(s):  
Pixu Shi ◽  
Yuchen Zhou ◽  
Anru R Zhang

Abstract In microbiome and genomic studies, the regression of compositional data has been a crucial tool for identifying microbial taxa or genes that are associated with clinical phenotypes. To account for the variation in sequencing depth, the classic log-contrast model is often used where read counts are normalized into compositions. However, zero read counts and the randomness in covariates remain critical issues. In this article, we introduce a surprisingly simple, interpretable, and efficient method for the estimation of compositional data regression through the lens of a novel high-dimensional log-error-in-variable regression model. The proposed method provides both corrections on sequencing data with possible overdispersion and simultaneously avoids any subjective imputation of zero read counts. We provide theoretical justifications with matching upper and lower bounds for the estimation error. The merit of the procedure is illustrated through real data analysis and simulation studies.


2020 ◽  
Vol 2 (2) ◽  
Author(s):  
Antoni Susin ◽  
Yiwen Wang ◽  
Kim-Anh Lê Cao ◽  
M Luz Calle

Abstract Though variable selection is one of the most relevant tasks in microbiome analysis, e.g. for the identification of microbial signatures, many studies still rely on methods that ignore the compositional nature of microbiome data. The applicability of compositional data analysis methods has been hampered by the availability of software and the difficulty in interpreting their results. This work is focused on three methods for variable selection that acknowledge the compositional structure of microbiome data: selbal, a forward selection approach for the identification of compositional balances, and clr-lasso and coda-lasso, two penalized regression models for compositional data analysis. This study highlights the link between these methods and brings out some limitations of the centered log-ratio transformation for variable selection. In particular, the fact that it is not subcompositionally consistent makes the microbial signatures obtained from clr-lasso not readily transferable. Coda-lasso is computationally efficient and suitable when the focus is the identification of the most associated microbial taxa. Selbal stands out when the goal is to obtain a parsimonious model with optimal prediction performance, but it is computationally greedy. We provide a reproducible vignette for the application of these methods that will enable researchers to fully leverage their potential in microbiome studies.


2020 ◽  
Vol 2 (2) ◽  
Author(s):  
David R Lovell ◽  
Xin-Yi Chua ◽  
Annette McGrath

Abstract Thanks to sequencing technology, modern molecular bioscience datasets are often compositions of counts, e.g. counts of amplicons, mRNAs, etc. While there is growing appreciation that compositional data need special analysis and interpretation, less well understood is the discrete nature of these count compositions (or, as we call them, lattice compositions) and the impact this has on statistical analysis, particularly log-ratio analysis (LRA) of pairwise association. While LRA methods are scale-invariant, count compositional data are not; consequently, the conclusions we draw from LRA of lattice compositions depend on the scale of counts involved. We know that additive variation affects the relative abundance of small counts more than large counts; here we show that additive (quantization) variation comes from the discrete nature of count data itself, as well as (biological) variation in the system under study and (technical) variation from measurement and analysis processes. Variation due to quantization is inevitable, but its impact on conclusions depends on the underlying scale and distribution of counts. We illustrate the different distributions of real molecular bioscience data from different experimental settings to show why it is vital to understand the distributional characteristics of count data before applying and drawing conclusions from compositional data analysis methods.


2016 ◽  
Vol 54 (1) ◽  
pp. 143-151 ◽  
Author(s):  
Enrique García Ordóñez ◽  
María del Carmen Iglesias Pérez ◽  
Carlos Touriño González

AbstractThe aim of the present study was to identify groups of offensive performance indicators which best discriminated between a match score (favourable, balanced or unfavourable) in water polo. The sample comprised 88 regular season games (2011-2014) from the Spanish Professional Water Polo League. The offensive performance indicators were clustered in five groups: Attacks in relation to the different playing situations; Shots in relation to the different playing situations; Attacks outcome; Origin of shots; Technical execution of shots. The variables of each group had a constant sum which equalled 100%. The data were compositional data, therefore the variables were changed by means of the additive log-ratio (alr) transformation. Multivariate discriminant analyses to compare the match scores were calculated using the transformed variables. With regard to the percentage of right classification, the results showed the group that discriminated the most between the match scores was “Attacks outcome” (60.4% for the original sample and 52.2% for cross-validation). The performance indicators that discriminated the most between the match scores in games with penalties were goals (structure coefficient (SC) = .761), counterattack shots (SC = .541) and counterattacks (SC = .481). In matches without penalties, goals were the primary discriminating factor (SC = .576). This approach provides a new tool to compare the importance of the offensive performance groups and their effect on the match score discrimination.


Mathematics ◽  
2021 ◽  
Vol 9 (19) ◽  
pp. 2477
Author(s):  
Seitebaleng Makgai ◽  
Andriette Bekker ◽  
Mohammad Arashi

The Dirichlet distribution is a well-known candidate in modeling compositional data sets. However, in the presence of outliers, the Dirichlet distribution fails to model such data sets, making other model extensions necessary. In this paper, the Kummer–Dirichlet distribution and the gamma distribution are coupled, using the beta-generating technique. This development results in the proposal of the Kummer–Dirichlet gamma distribution, which presents greater flexibility in modeling compositional data sets. Some general properties, such as the probability density functions and the moments are presented for this new candidate. The method of maximum likelihood is applied in the estimation of the parameters. The usefulness of this model is demonstrated through the application of synthetic and real data sets, where outliers are present.


GigaScience ◽  
2019 ◽  
Vol 8 (9) ◽  
Author(s):  
Thomas P Quinn ◽  
Ionas Erb ◽  
Greg Gloor ◽  
Cedric Notredame ◽  
Mark F Richardson ◽  
...  

Abstract Background Next-generation sequencing (NGS) has made it possible to determine the sequence and relative abundance of all nucleotides in a biological or environmental sample. A cornerstone of NGS is the quantification of RNA or DNA presence as counts. However, these counts are not counts per se: their magnitude is determined arbitrarily by the sequencing depth, not by the input material. Consequently, counts must undergo normalization prior to use. Conventional normalization methods require a set of assumptions: they assume that the majority of features are unchanged and that all environments under study have the same carrying capacity for nucleotide synthesis. These assumptions are often untestable and may not hold when heterogeneous samples are compared. Results Methods developed within the field of compositional data analysis offer a general solution that is assumption-free and valid for all data. Herein, we synthesize the extant literature to provide a concise guide on how to apply compositional data analysis to NGS count data. Conclusions In highlighting the limitations of total library size, effective library size, and spike-in normalizations, we propose the log-ratio transformation as a general solution to answer the question, “Relative to some important activity of the cell, what is changing?”


Sign in / Sign up

Export Citation Format

Share Document