A New Approach of Outlier-robust Missing Value Imputation for Metabolomics Data Analysis

Background: Metabolomics data generation and quantification are different from other types of molecular “omics” data in bioinformatics. Mass spectrometry (MS) based (gas chromatography mass spectrometry (GC-MS), liquid chromatography mass spectrometry (LC-MS), etc.) metabolomics data frequently contain missing values that make some quantitative analysis complex. Typically metabolomics datasets contain 10% to 20% missing values that originate from several reasons, like analytical, computational as well as biological hazard. Imputation of missing values is a very important and interesting issue for further metabolomics data analysis. </P><P> Objective: This paper introduces a new algorithm for missing value imputation in the presence of outliers for metabolomics data analysis. </P><P> Method: Currently, the most well known missing value imputation techniques in metabolomics data are knearest neighbours (kNN), random forest (RF) and zero imputation. However, these techniques are sensitive to outliers. In this paper, we have proposed an outlier robust missing imputation technique by minimizing twoway empirical mean absolute error (MAE) loss function for imputing missing values in metabolomics data. Results: We have investigated the performance of the proposed missing value imputation technique in a comparison of the other traditional imputation techniques using both simulated and real data analysis in the absence and presence of outliers. Conclusion: Results of both simulated and real data analyses show that the proposed outlier robust missing imputation technique is better performer than the traditional missing imputation methods in both absence and presence of outliers.

Download Full-text

Metabolomic Biomarker Identification in Presence of Outliers and Missing Values

BioMed Research International ◽

10.1155/2017/2437608 ◽

2017 ◽

Vol 2017 ◽

pp. 1-11 ◽

Cited By ~ 9

Author(s):

Nishith Kumar ◽

Md. Aminul Hoque ◽

Md. Shahjaman ◽

S. M. Shahinul Islam ◽

Md. Nurul Haque Mollah

Keyword(s):

Data Analysis ◽

High Throughput ◽

Missing Values ◽

Real Data ◽

Data Matrix ◽

Metabolomics Data ◽

Missing Value ◽

Biomarker Identification ◽

Missing Value Imputation ◽

High Throughput Technology

Metabolomics is the sophisticated and high-throughput technology based on the entire set of metabolites which is known as the connector between genotypes and phenotypes. For any phenotypic changes, potential metabolite (biomarker) identification is very important because it provides diagnostic as well as prognostic markers and can help to develop new biomolecular therapy. Biomarker identification from metabolomics data analysis is hampered by the use of high-throughput technology that provides high dimensional data matrix which contains missing values as well as outliers. However, missing value imputation and outliers handling techniques play important role in identifying biomarker correctly. Although several missing value imputation techniques are available, outliers deteriorate the accuracy of imputation as well as the accuracy of biomarker identification. Therefore, in this paper we have proposed a new biomarker identification technique combining the groupwise robust singular value decomposition, t-test, and fold-change approach that can identify biomarkers more correctly from metabolomics dataset. We have also compared the performance of the proposed technique with those of other traditional techniques for biomarker identification using both simulated and real data analysis in absence and presence of outliers. Using our proposed method in hepatocellular carcinoma (HCC) dataset, we have also identified the four upregulated and two downregulated metabolites as potential metabolomic biomarkers for HCC disease.

Download Full-text

Missing Value Imputation Approach for Mass Spectrometry-based Metabolomics Data

10.1101/171967 ◽

2017 ◽

Cited By ~ 1

Author(s):

Runmin Wei ◽

Jingye Wang ◽

Mingming Su ◽

Erik Jia ◽

Tianlu Chen ◽

...

Keyword(s):

Mass Spectrometry ◽

Missing Values ◽

Pearson Correlation ◽

Imputation Accuracy ◽

Metabolomics Data ◽

Missing Value ◽

Sample Distribution ◽

Imputation Methods ◽

Missing Value Imputation ◽

Squared Error

AbstractIntroductionMissing values exist widely in mass-spectrometry (MS) based metabolomics data. Various methods have been applied for handling missing values, but the selection of methods can significantly affect following data analyses and interpretations. According to the definition, there are three types of missing values, missing completely at random (MCAR), missing at random (MAR), and missing not at random (MNAR).ObjectivesThe aim of this study was to comprehensively compare common imputation methods for different types of missing values using two separate metabolomics data sets (977 and 198 serum samples respectively) to propose a strategy to deal with missing values in metabolomics studies.MethodsImputation methods included zero, half minimum (HM), mean, median, random forest (RF), singular value decomposition (SVD), k-nearest neighbors (kNN), and quantile regression imputation of left-censored data (QRILC). Normalized root mean squared error (NRMSE) and NRMSE-based sum of ranks (SOR) were applied to evaluate the imputation accuracy for MCAR/MAR and MNAR correspondingly. Principal component analysis (PCA)/partial least squares (PLS)-Procrustes sum of squared error were used to evaluate the overall sample distribution. Student’s t-test followed by Pearson correlation analysis was conducted to evaluate the effect of imputation on univariate statistical analysis.ResultsOur findings demonstrated that RF imputation performed the best for MCAR/MAR and QRILC was the favored one for MNAR.ConclusionCombining with “modified 80% rule”, we proposed a comprehensive strategy and developed a public-accessible web-tool for missing value imputation in metabolomics data.

Download Full-text

Web Server for Peak Detection, Baseline Correction, and Alignment in Two-Dimensional Gas Chromatography Mass Spectrometry-Based Metabolomics Data

Analytical Chemistry ◽

10.1021/acs.analchem.6b00755 ◽

2016 ◽

Vol 88 (21) ◽

pp. 10395-10403 ◽

Cited By ~ 11

Author(s):

Tze-Feng Tian ◽

San-Yuan Wang ◽

Tien-Chueh Kuo ◽

Cheng-En Tan ◽

Guan-Yuan Chen ◽

...

Keyword(s):

Mass Spectrometry ◽

Gas Chromatography ◽

Web Server ◽

Peak Detection ◽

Gas Chromatography Mass Spectrometry ◽

Baseline Correction ◽

Two Dimensional ◽

Metabolomics Data ◽

Chromatography Mass Spectrometry

Download Full-text

ADAP-GC 4.0: Application of Clustering-Assisted Multivariate Curve Resolution to Spectral Deconvolution of Gas Chromatography–Mass Spectrometry Metabolomics Data

Analytical Chemistry ◽

10.1021/acs.analchem.9b01424 ◽

2019 ◽

Vol 91 (14) ◽

pp. 9069-9077 ◽

Cited By ~ 6

Author(s):

Aleksandr Smirnov ◽

Yunping Qiu ◽

Wei Jia ◽

Douglas I. Walker ◽

Dean P. Jones ◽

...

Keyword(s):

Mass Spectrometry ◽

Gas Chromatography ◽

Multivariate Curve Resolution ◽

Gas Chromatography Mass Spectrometry ◽

Spectral Deconvolution ◽

Metabolomics Data ◽

Curve Resolution ◽

Chromatography Mass Spectrometry

Download Full-text

Validated and Predictive Processing of Gas Chromatography-Mass Spectrometry Based Metabolomics Data for Large Scale Screening Studies, Diagnostics and Metabolite Pattern Verification

Metabolites ◽

10.3390/metabo2040796 ◽

2012 ◽

Vol 2 (4) ◽

pp. 796-817 ◽

Cited By ~ 6

Author(s):

Elin Thysell ◽

Elin Chorell ◽

Michael Svensson ◽

Pär Jonsson ◽

Henrik Antti

Keyword(s):

Mass Spectrometry ◽

Gas Chromatography ◽

Large Scale ◽

Predictive Processing ◽

Gas Chromatography Mass Spectrometry ◽

Metabolomics Data ◽

Chromatography Mass Spectrometry ◽

Metabolite Pattern ◽

Large Scale Screening

Download Full-text

Comparison of Algorithms for Clustering Incomplete Data

Foundations of Computing and Decision Sciences ◽

10.2478/fcds-2014-0007 ◽

2014 ◽

Vol 39 (2) ◽

pp. 107-127 ◽

Cited By ~ 6

Author(s):

Artur Matyja ◽

Krzysztof Siminski

Keyword(s):

Data Analysis ◽

Incomplete Data ◽

Missing Values ◽

Real Data ◽

Complete Data ◽

The Other ◽

Data Sets ◽

Missing Value ◽

Comparison Of Algorithms ◽

New Algorithms

Abstract The missing values are not uncommon in real data sets. The algorithms and methods used for the data analysis of complete data sets cannot always be applied to missing value data. In order to use the existing methods for complete data, the missing value data sets are preprocessed. The other solution to this problem is creation of new algorithms dedicated to missing value data sets. The objective of our research is to compare the preprocessing techniques and specialised algorithms and to find their most advantageous usage.

Download Full-text