Another look at the constant sum problem in geochemistry

AbstractCompositional data—that is data where concentrations are expressed as proportions of a whole, such as percentages or parts per million—have a number of peculiar mathematical properties which make standard statistical tests unworkable. In particular correlation analysis can produce geologically meaningless results. Aitchison (1986) proposed a log-ratio transformation of compositional data which allows inter-element relationships to be investigated. This method was applied to two sets of geochemical data—basalts from Kilauea Iki lava lake and grantic gneisses from the Limpopo Belt—and geologically 'sensible' results were obtained. Geochemists are encouraged to adopt the Aitchison method of data analysis in preference to the traditional but invalid approach which uses compositional data.

Download Full-text

Mineral exploration in the glaciated terrain using upper soil horizon geochemistry and compositional statistical data analysis

10.5194/egusphere-egu21-10191 ◽

2021 ◽

Author(s):

Pertti Sarala ◽

Solveig Pospiech ◽

Maarit Middleton ◽

Anne Taivalkoski ◽

Helena Hulkki ◽

...

Keyword(s):

Data Analysis ◽

Statistical Data ◽

Mineral Exploration ◽

Environmentally Friendly ◽

Soil Horizon ◽

Geochemical Data ◽

Elemental Distribution ◽

Statistical Data Analysis ◽

Aqua Regia ◽

Log Ratio

Vulnerable nature in northernmost Europe requires development of new, environmentally friendly sampling and analyses techniques for mineral exploration. Those areas are typically covered by transported glaciogenic sediments where the glacial till is most dominant. To offer an alternative for conventional basal till and bedrock sampling with heavy machines, the use of different surface geochemical sampling media and techniques which are quick and cost-effective have been actively applied during the last decade. Particularly, the development of selective and weak leach techniques for the upper soil (Ah and B) horizons&#8217; geochemistry has been intensive, but the reliability needs to be improved and testing is required in different glaciogenic environments.In this research, carried out under the project New Exploration Technologies (NEXT), funded by the European Union&#8217;s Horizon 2020 research and innovation programme under grant agreement No 776804, we used stratified random sampling strategy for choosing sampling locations and developed novel compositional statistical data analysis for the interpretation of geochemical data obtained by surface geochemical techniques. The test area is located in the Rajapalot area, Ylitornio, northern Finland, where an active project is carried out by Mawson Oy for Au-Co exploration. The thickness of till cover varies from some metres to 5 m and the glacial morphology is composed of the ribbed moraine ridges with peatlands in between. A sampling network for the Ah and B horizon samples was comprised of 89 routine samples and 10 field replicates acquired of mineral Podsol-type soils. The chemical analyses methods used were Ultratrace 1:1:1 Aqua Regia leach and 0.1 M sodium pyrophosphate leach for the Ah horizon samples, and Ionic leach and Super Trace Aqua Regia leach methods for the B horizon samples. The laboratory analyses were supported by the portable X-Ray Fluorescence (pXRF) analyses done directly in the field. The statistical analysis was based on log-ratio transformations of the geochemical compositions to avoid spurious results. In addition, the response ratios were calculated to measure the degree of enrichment in each element per sample.The preliminary results of the soil geochemistry show a significant response to many elements (e.g. Au, Co, Cu, Mo, Sc, Te and W) with known mineralized bedrock targets observed in the drill core data. Elemental distribution is also reflecting the lithological variations of the rock units in the bedrock. Based on the results, it is obvious that a) there is good or moderate correlation for several elements between the surface geochemical data and underlying bedrock, and b) soil analysis method using certain soil sampling procedure and selective extraction is an effective, environmentally friendly geochemical exploration technique in the glaciated terrains.

Download Full-text

Mathematization, Not Measurement: A Critique of Stevens’ Scales of Measurement

Journal of Methods and Measurement in the Social Sciences ◽

10.2458/v10i2.23785 ◽

2020 ◽

Vol 10 (2) ◽

pp. 76-94

Author(s):

M. A. Thomas

Keyword(s):

Social Sciences ◽

Data Analysis ◽

Statistical Tests ◽

Real Numbers ◽

Ongoing Debate ◽

Mathematical Properties ◽

Psychological Data ◽

The Social ◽

Scales Of Measurement

In the early 1900s, physics was the archetypical science and measurement was equated with mathematization to real numbers. To enable the use of mathematics to draw empirical conclusions about psychological data, which was often ordinal, Stevens redefined measurement as “the assignment of numerals to objects and events according to a rule.” He defined four scales of measurement (nominal, ordinal, interval, and ratio) and set out criteria for the permissible statistical tests to be used with each. Stevens' scales of measurement are still widely used in data analysis in the social sciences. They were revolutionary but flawed, leading to ongoing debate about the permissibility of the use of different statistical tests on different scales of data. Stevens implicitly assumed measurement involved mapping to real numbers. Rather than rely on Stevens' scales, researchers should demonstrate the mathematical properties of their data and map to analogous number sets, making claims regarding mathematization explicit, defending them with evidence, and using only those operations that are defined for that set.

Download Full-text

Variable selection in microbiome compositional data analysis

NAR Genomics and Bioinformatics ◽

10.1093/nargab/lqaa029 ◽

2020 ◽

Vol 2 (2) ◽

Cited By ~ 2

Author(s):

Antoni Susin ◽

Yiwen Wang ◽

Kim-Anh Lê Cao ◽

M Luz Calle

Keyword(s):

Data Analysis ◽

Variable Selection ◽

Compositional Data ◽

Penalized Regression ◽

Compositional Data Analysis ◽

Forward Selection ◽

Computationally Efficient ◽

Parsimonious Model ◽

Microbiome Data ◽

Log Ratio

Abstract Though variable selection is one of the most relevant tasks in microbiome analysis, e.g. for the identification of microbial signatures, many studies still rely on methods that ignore the compositional nature of microbiome data. The applicability of compositional data analysis methods has been hampered by the availability of software and the difficulty in interpreting their results. This work is focused on three methods for variable selection that acknowledge the compositional structure of microbiome data: selbal, a forward selection approach for the identification of compositional balances, and clr-lasso and coda-lasso, two penalized regression models for compositional data analysis. This study highlights the link between these methods and brings out some limitations of the centered log-ratio transformation for variable selection. In particular, the fact that it is not subcompositionally consistent makes the microbial signatures obtained from clr-lasso not readily transferable. Coda-lasso is computationally efficient and suitable when the focus is the identification of the most associated microbial taxa. Selbal stands out when the goal is to obtain a parsimonious model with optimal prediction performance, but it is computationally greedy. We provide a reproducible vignette for the application of these methods that will enable researchers to fully leverage their potential in microbiome studies.

Download Full-text

Assessing Global Covid-19 Cases Data through Compositional Data Analysis(CoDa)

10.1101/2020.12.17.20248424 ◽

2020 ◽

Author(s):

Luis P.V. Braga ◽

Dina Feigenbaum

Keyword(s):

Data Analysis ◽

Compositional Data ◽

Compositional Data Analysis ◽

Discrete Groups ◽

Data Sets ◽

Cumulative Number ◽

Governmental Agencies ◽

Global Pandemic ◽

Number Of Patients ◽

Log Ratio

AbstractBackgroundCovid-19 cases data pose an enormous challenge to any analysis. The evaluation of such a global pandemic requires matching reports that follow different procedures and even overcoming some countries’ censorship that restricts publications.MethodsThis work proposes a methodology that could assist future studies. Compositional Data Analysis (CoDa) is proposed as the proper approach as Covid-19 cases data is compositional in nature. Under this methodology, for each country three attributes were selected: cumulative number of deaths (D); cumulative number of recovered patients(R); present number of patients (A).ResultsAfter the operation called closure, with c=1, a ternary diagram and Log-Ratio plots, as well as, compositional statistics are presented. Cluster analysis is then applied, splitting the countries into discrete groups.ConclusionsThis methodology can also be applied to other data sets such as countries, cities, provinces or districts in order to help authorities and governmental agencies to improve their actions to fight against a pandemic.

Download Full-text

A field guide for the compositional analysis of any-omics data

GigaScience ◽

10.1093/gigascience/giz107 ◽

2019 ◽

Vol 8 (9) ◽

Cited By ~ 22

Author(s):

Thomas P Quinn ◽

Ionas Erb ◽

Greg Gloor ◽

Cedric Notredame ◽

Mark F Richardson ◽

...

Keyword(s):

Data Analysis ◽

General Solution ◽

Compositional Data ◽

Compositional Analysis ◽

Compositional Data Analysis ◽

Nucleotide Synthesis ◽

Library Size ◽

Next Generation Sequencing Ngs ◽

Concise Guide ◽

Log Ratio

Abstract Background Next-generation sequencing (NGS) has made it possible to determine the sequence and relative abundance of all nucleotides in a biological or environmental sample. A cornerstone of NGS is the quantification of RNA or DNA presence as counts. However, these counts are not counts per se: their magnitude is determined arbitrarily by the sequencing depth, not by the input material. Consequently, counts must undergo normalization prior to use. Conventional normalization methods require a set of assumptions: they assume that the majority of features are unchanged and that all environments under study have the same carrying capacity for nucleotide synthesis. These assumptions are often untestable and may not hold when heterogeneous samples are compared. Results Methods developed within the field of compositional data analysis offer a general solution that is assumption-free and valid for all data. Herein, we synthesize the extant literature to provide a concise guide on how to apply compositional data analysis to NGS count data. Conclusions In highlighting the limitations of total library size, effective library size, and spike-in normalizations, we propose the log-ratio transformation as a general solution to answer the question, “Relative to some important activity of the cell, what is changing?”

Download Full-text

On establishing ceramic chemical groups: exploring the influence of data analysis methods and the role of the elements chosen in analysis

Open Journal of Archaeometry ◽

10.4081/arc.2013.e1 ◽

2013 ◽

Vol 1 (1) ◽

pp. 1 ◽

Cited By ~ 5

Author(s):

Kostalena Michelaki ◽

Michael J. Hughes ◽

Ronald G.V. Hancock

Keyword(s):

Principal Component Analysis ◽

Data Analysis ◽

Compositional Data ◽

Principal Component ◽

Component Analysis ◽

Data Exploration ◽

Short Paper ◽

Bivariate Data ◽

Chemical Groups ◽

Log Ratio

Since the 1970s, archaeologists have increasingly depended on archaeometric rather than strictly stylistic data to explore questions of ceramic provenance and technol- ogy, and, by extension, trade, exchange, social networks and even identity. It is accepted as obvious by some archaeometrists and statisti- cians that the results of the analyses of compo- sitional data may be dependent on the format of the data used, on the data exploration method employed and, in the case of multivari- ate analyses, even on the number of elements considered. However, this is rarely articulated clearly in publications, making it less obvious to archaeologists. In this short paper, we re- examine compositional data from a collection of bricks, tiles and ceramics from Hill Hall, near Epping in Essex, England, as a case study to show how the method of data exploration used and the number of elements considered in multivariate analyses of compositional data can affect the sorting of ceramic samples into chemical groups. We compare bivariate data splitting (BDS) with principal component analysis (PCA) and centered log ratio-principal component analysis (CLR-PCA) of different unstandardized data formats [original concen- tration data and logarithmically transformed (i.e. log10 data)], using different numbers of elements. We confirm that PCA, in its various forms, is quite sensitive to the numbers and types of elements used in data analysis.

Download Full-text

An application of compositional data analysis to multiomic time-series data

NAR Genomics and Bioinformatics ◽

10.1093/nargab/lqaa079 ◽

2020 ◽

Vol 2 (4) ◽

Cited By ~ 1

Author(s):

Laura Sisk-Hackworth ◽

Scott T Kelley

Keyword(s):

Data Analysis ◽

Time Series Data ◽

Compositional Data ◽

Series Data ◽

Compositional Data Analysis ◽

Metabolomics Data ◽

Normalization Methods ◽

Next Generation Sequencing Ngs ◽

Ngs Data ◽

Log Ratio

Abstract Compositional data analysis (CoDA) methods have increased in popularity as a new framework for analyzing next-generation sequencing (NGS) data. CoDA methods, such as the centered log-ratio (clr) transformation, adjust for the compositional nature of NGS counts, which is not addressed by traditional normalization methods. CoDA has only been sparsely applied to NGS data generated from microbial communities or to multiple ‘omics’ datasets. In this study, we applied CoDA methods to analyze NGS and untargeted metabolomic datasets obtained from bacterial and fungal communities. Specifically, we used clr transformation to reanalyze NGS amplicon and metabolomics data from a study investigating the effects of building material type, moisture and time on microbial and metabolomic diversity. Compared to analysis of untransformed data, analysis of clr-transformed data revealed novel relationships and stronger associations between sample conditions and microbial and metabolic community profiles.

Download Full-text

A field guide for the compositional analysis of any-omics data

10.1101/484766 ◽

2018 ◽

Cited By ~ 3

Author(s):

Thomas P. Quinn ◽

Ionas Erb ◽

Greg Gloor ◽

Cedric Notredame ◽

Mark F. Richardson ◽

...

Keyword(s):

Data Analysis ◽

General Solution ◽

Compositional Data ◽

Compositional Analysis ◽

Compositional Data Analysis ◽

Sequencing Data ◽

Nucleotide Synthesis ◽

Library Size ◽

Concise Guide ◽

Log Ratio

AbstractNext-generation sequencing (NGS) has made it possible to determine the sequence and relative abundance of all nucleotides in a biological or environmental sample. Today, NGS is routinely used to understand many important topics in biology from human disease to microorganism diversity. A cornerstone of NGS is the quantification of RNA or DNA presence as counts. However, these counts are not counts per se: the magnitude of the counts are determined arbitrarily by the sequencing depth, not by the input material. Consequently, counts must undergo normalization prior to use. Conventional normalization methods require a set of assumptions: they assume that the majority of features are unchanged, and that all environments under study have the same carrying capacity for nucleotide synthesis. These assumptions are often untestable and may not hold when comparing heterogeneous samples (e.g., samples collected across distinct cancers or tissues). Instead, methods developed within the field of compositional data analysis offer a general solution that is assumption-free and valid for all data. In this manuscript, we synthesize the extant literature to provide a concise guide on how to apply compositional data analysis to NGS count data. In doing so, we review zero replacement, differential abundance analysis, and within-group and between-group coordination analysis. We then discuss how this pipeline can accommodate complex study design, facilitate the analysis of vertically and horizontally integrated data, including multiomics data, and further extend to single-cell sequencing data. In highlighting the limitations of total library size, effective library size, and spike-in normalizations, we propose the log-ratio transformation as a general solution to answer the question, “Relative to some important activity of the cell, what is changing?”. Taken together, this manuscript establishes the first fully comprehensive analysis protocol that is suitable for any and all -omics data.

Download Full-text

Snow as environmentally low-impact sampling media for mineral exploration - a case study from Northern Finland

10.5194/egusphere-egu21-5174 ◽

2021 ◽

Author(s):

Solveig Pospiech ◽

Anne Taivalkoski ◽

Yann Lahaye ◽

Pertti Sarala ◽

Janne Kinnunen ◽

...

Keyword(s):

Compositional Data ◽

Mineral Exploration ◽

Test Site ◽

Geochemical Data ◽

The European Union ◽

Surface Sampling ◽

Horizon 2020 ◽

The Impact ◽

Log Ratio ◽

Snow Samples

Modern mineral exploration is required to be conducted in a sustainable, environmentally friendly and socially acceptable way. Especially for the geochemical exploration on ecologically sensitive areas this poses a challenge because any heavy machinery or invasive methods might cause long-lasting damage to nature. One way of reducing the impact of mineral exploration on the environment during the early stages of exploration is to use surface sampling media, such as upper soil horizons, water, plants and, on high latitudes, also snow. Of these options, snow has several advantages: Sampling and analysing snow is fast and low in costs, it has no impact on the environment, and in wintertime it is ubiquitous and available independent of the ecosystem. In the &#8220;New Exploration Technologies (NEXT)&#8221; project*, snow samples were collected in March-April 2019 to evaluate the usage of snow as a sampling material for mineral exploration. The test site was the Rajapalot Au-Co prospect in northern Finland, located 60 km west from Rovaniemi and operated by Mawson Oy. A stratified random sampling strategy was applied to place the sampling stations on the test site. The sampling comprised 94 snow samples and 12 field replicates. The samples were analysed at the GTK Research laboratory using a Nu AttoM single collector inductively coupled plasma mass spectrometry (SC-ICPMS) which returned analytical results for 52 elements at the ppt level. After applying quality control to the data, the elements Ba, Ca, Cd, Cr, Cs, Ga, Li, Mg, Rb, Sr, Tl and V showed good quality and were used in the final data analysis. Geochemical data of drill cores were used to train a model to predict bedrock geochemistry based on the 12 available element concentrations of snow analysis. Prior to statistical methods, all geochemical data was transformed to log-ratio scores in order to ensure that results are independent of the selection of elements and to avoid spurious correlations (compositional data approach). Results show that snow data provide reasonable predictions of bedrock geochemistry for elements such as Ca, Cr, Li and Mg, but also for elements not used in snow data, such as Mn and Na. This suggests that snow can serve as a lithogeochemical mapping tool for potential geological domains. For the ore related elements Au, Ag, Co, and U the model provided predictions with higher uncertainty. Yet, the pattern of the predicted values of ore related elements show that snow can also be used to delineate prospective areas for continuing exploration with more sensitive methods. *) This project has received funding from the European Union&#8217;s Horizon 2020 research and innovation programme under grant agreement No 776804.

Download Full-text

Zero problems with compositional data of physical behaviors: a comparison of three zero replacement methods

International Journal of Behavioral Nutrition and Physical Activity ◽

10.1186/s12966-020-01029-z ◽

2020 ◽

Vol 17 (1) ◽

Author(s):

Charlotte Lund Rasmussen ◽

Javier Palarea-Albaladejo ◽

Melker Staffan Johansson ◽

Patrick Crowley ◽

Matthew Leigh Stevens ◽

...

Keyword(s):

Data Analysis ◽

Time Use ◽

Compositional Data ◽

Real Data ◽

Compositional Data Analysis ◽

Accelerometer Data ◽

Relative Variation ◽

Physical Behavior ◽

Complete Dataset ◽

Log Ratio

Abstract Background Researchers applying compositional data analysis to time-use data (e.g., time spent in physical behaviors) often face the problem of zeros, that is, recordings of zero time spent in any of the studied behaviors. Zeros hinder the application of compositional data analysis because the analysis is based on log-ratios. One way to overcome this challenge is to replace the zeros with sensible small values. The aim of this study was to compare the performance of three existing replacement methods used within physical behavior time-use epidemiology: simple replacement, multiplicative replacement, and log-ratio expectation-maximization (lrEM) algorithm. Moreover, we assessed the consequence of choosing replacement values higher than the lowest observed value for a given behavior. Method Using a complete dataset based on accelerometer data from 1310 Danish adults as reference, multiple datasets were simulated across six scenarios of zeros (5–30% zeros in 5% increments). Moreover, four examples were produced based on real data, in which, 10 and 20% zeros were imposed and replaced using a replacement value of 0.5 min, 65% of the observation threshold, or an estimated value below the observation threshold. For the simulation study and the examples, the zeros were replaced using the three replacement methods and the degree of distortion introduced was assessed by comparison with the complete dataset. Results The lrEM method outperformed the other replacement methods as it had the smallest influence on the structure of relative variation of the datasets. Both the simple and multiplicative replacements introduced higher distortion, particularly in scenarios with more than 10% zeros; although the latter, like the lrEM, does preserve the ratios between behaviors with no zeros. The examples revealed that replacing zeros with a value higher than the observation threshold severely affected the structure of relative variation. Conclusions Given our findings, we encourage the use of replacement methods that preserve the relative structure of physical behavior data, as achieved by the multiplicative and lrEM replacements, and to avoid simple replacement. Moreover, we do not recommend replacing zeros with values higher than the lowest observed value for a behavior.

Download Full-text