Preprocessing and Pretreatment of Metabolomics Data for Statistical Analysis

Author(s):  
Ibrahim Karaman
Metabolomics ◽  
2015 ◽  
Vol 11 (6) ◽  
pp. 1492-1513 ◽  
Author(s):  
Sheng Ren ◽  
Anna A. Hinzman ◽  
Emily L. Kang ◽  
Rhonda D. Szczesniak ◽  
Long Jason Lu

Metabolites ◽  
2018 ◽  
Vol 8 (3) ◽  
pp. 47 ◽  
Author(s):  
Helena Zacharias ◽  
Michael Altenbuchinger ◽  
Wolfram Gronwald

In this review, we summarize established and recent bioinformatic and statistical methods for the analysis of NMR-based metabolomics. Data analysis of NMR metabolic fingerprints exhibits several challenges, including unwanted biases, high dimensionality, and typically low sample numbers. Common analysis tasks comprise the identification of differential metabolites and the classification of specimens. However, analysis results strongly depend on the preprocessing of the data, and there is no consensus yet on how to remove unwanted biases and experimental variance prior to statistical analysis. Here, we first review established and new preprocessing protocols and illustrate their pros and cons, including different data normalizations and transformations. Second, we give a brief overview of state-of-the-art statistical analysis in NMR-based metabolomics. Finally, we discuss a recent development in statistical data analysis, where data normalization becomes obsolete. This method, called zero-sum regression, builds metabolite signatures whose estimation as well as predictions are independent of prior normalization.


2019 ◽  
Author(s):  
Paola G. Ferrario

AbstractIn metabolomics, the investigation of an association between many metabolites and one trait (such as age in humans or cultivar in foods) is a central research question. On this topic, we present a complete statistical analysis, combining selected R packages in a new workflow, which we are sharing completely, according to modern standards and research reproducibility requirements. We demonstrate the workflow using a large-scale study with public data, available on repositories. Hence, the workflow can directly be re-used on quite different metabolomics data, when searching for association with one covariate of interest.


2021 ◽  
Author(s):  
Miao Yu ◽  
Georgia Dolios ◽  
Lauren Petrick

<p>Unknown features in untargeted metabolomics and non-targeted analysis (NTA) are identified using fragment ions from MS/MS spectra to predict the structures of the unknown compounds. The precursor ion selected for fragmentation is commonly performed using data dependent acquisition (DDA) strategies or following statistical analysis using targeted MS/MS approaches. However, the selected precursor ions from DDA only cover a biased subset of the peaks or features found in full scan data. In addition, different statistical analysis can select different precursor ions for MS/MS analysis, which make the <i>post-hoc</i> validation of ions selected by new statistical methods impossible for precursor ions selected by the original statistical method. Here we propose an automated, exhaustive, statistical model-free workflow: paired mass distance-dependent analysis (PMDDA), for untargeted mass spectrometry identification of unknown compounds. By removing redundant peaks and performing pseudo-targeted MS/MS analysis on independent peaks, we can comprehensively cover unknown compounds found in full scan analysis using a “one peak for one compound” workflow without a priori redundant peak information. We show that compared to DDA, PMDDA is more comprehensive and robust against samples' matrix effects. Further, more compounds were identified by database annotation using PMDDA compared with CAMERA and RAMClustR. Finally, compounds with signals in both positive and negative modes can be identified by the PMDDA workflow, to further reduce redundancies. The whole workflow is fully reproducible as a docker image xcmsrocker with both the original data and the data processing template. </p>


2019 ◽  
Vol 20 (1) ◽  
Author(s):  
Marietta Kokla ◽  
Jyrki Virtanen ◽  
Marjukka Kolehmainen ◽  
Jussi Paananen ◽  
Kati Hanhineva

Abstract Background LC-MS technology makes it possible to measure the relative abundance of numerous molecular features of a sample in single analysis. However, especially non-targeted metabolite profiling approaches generate vast arrays of data that are prone to aberrations such as missing values. No matter the reason for the missing values in the data, coherent and complete data matrix is always a pre-requisite for accurate and reliable statistical analysis. Therefore, there is a need for proper imputation strategies that account for the missingness and reduce the bias in the statistical analysis. Results Here we present our results after evaluating nine imputation methods in four different percentages of missing values of different origin. The performance of each imputation method was analyzed by Normalized Root Mean Squared Error (NRMSE). We demonstrated that random forest (RF) had the lowest NRMSE in the estimation of missing values for Missing at Random (MAR) and Missing Completely at Random (MCAR). In case of absent values due to Missing Not at Random (MNAR), the left truncated data was best imputed with minimum value imputation. We also tested the different imputation methods for datasets containing missing data of various origin, and RF was the most accurate method in all cases. The results were obtained by repeating the evaluation process 100 times with the use of metabolomics datasets where the missing values were introduced to represent absent data of different origin. Conclusion Type and rate of missingness affects the performance and suitability of imputation methods. RF-based imputation method performs best in most of the tested scenarios, including combinations of different types and rates of missingness. Therefore, we recommend using random forest-based imputation for imputing missing metabolomics data, and especially in situations where the types of missingness are not known in advance.


Sign in / Sign up

Export Citation Format

Share Document