Effective Missing Value Imputation Methods for Building Monitoring Data

AbstractIntroductionMissing values exist widely in mass-spectrometry (MS) based metabolomics data. Various methods have been applied for handling missing values, but the selection of methods can significantly affect following data analyses and interpretations. According to the definition, there are three types of missing values, missing completely at random (MCAR), missing at random (MAR), and missing not at random (MNAR).ObjectivesThe aim of this study was to comprehensively compare common imputation methods for different types of missing values using two separate metabolomics data sets (977 and 198 serum samples respectively) to propose a strategy to deal with missing values in metabolomics studies.MethodsImputation methods included zero, half minimum (HM), mean, median, random forest (RF), singular value decomposition (SVD), k-nearest neighbors (kNN), and quantile regression imputation of left-censored data (QRILC). Normalized root mean squared error (NRMSE) and NRMSE-based sum of ranks (SOR) were applied to evaluate the imputation accuracy for MCAR/MAR and MNAR correspondingly. Principal component analysis (PCA)/partial least squares (PLS)-Procrustes sum of squared error were used to evaluate the overall sample distribution. Student’s t-test followed by Pearson correlation analysis was conducted to evaluate the effect of imputation on univariate statistical analysis.ResultsOur findings demonstrated that RF imputation performed the best for MCAR/MAR and QRILC was the favored one for MNAR.ConclusionCombining with “modified 80% rule”, we proposed a comprehensive strategy and developed a public-accessible web-tool for missing value imputation in metabolomics data.

Download Full-text

Evaluation of missing value imputation methods for wireless soil datasets

Personal and Ubiquitous Computing ◽

10.1007/s00779-016-0978-9 ◽

2016 ◽

Vol 21 (1) ◽

pp. 113-123 ◽

Cited By ~ 5

Author(s):

Jia Shao ◽

Wei Meng ◽

Guodong Sun

Keyword(s):

Missing Value ◽

Imputation Methods ◽

Missing Value Imputation

Download Full-text

Missing value imputation methods for gene-sample-time microarray data analysis

2010 IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology ◽

10.1109/cibcb.2010.5510349 ◽

2010 ◽

Cited By ~ 2

Author(s):

Yifeng Li ◽

Alioune Ngom ◽

Luis Rueda

Keyword(s):

Data Analysis ◽

Microarray Data ◽

Microarray Data Analysis ◽

Missing Value ◽

Imputation Methods ◽

Missing Value Imputation ◽

Sample Time

Download Full-text

Missing value imputation methods for TCM medical data and its effect in the classifier accuracy

2017 IEEE 19th International Conference on e-Health Networking, Applications and Services (Healthcom) ◽

10.1109/healthcom.2017.8210844 ◽

2017 ◽

Cited By ~ 1

Author(s):

Dan Zeng ◽

Dan Xie ◽

Ran Liu ◽

Xiaodong Li

Keyword(s):

Medical Data ◽

Missing Value ◽

Imputation Methods ◽

Missing Value Imputation

Download Full-text

GSimp: A Gibbs sampler based left-censored missing value imputation approach for metabolomics studies

10.1101/177410 ◽

2017 ◽

Author(s):

Runmin Wei ◽

Jingye Wang ◽

Erik Jia ◽

Tianlu Chen ◽

Yan Ni ◽

...

Keyword(s):

Gibbs Sampler ◽

Real World ◽

Missing Values ◽

Statistical Analyses ◽

Targeted Metabolomics ◽

Missing Not At Random ◽

Missing Value ◽

Imputation Methods ◽

Missing Value Imputation ◽

Imputation Approach

AbstractLeft-censored missing values commonly exist in targeted metabolomics datasets and can be considered as missing not at random (MNAR). Improper data processing procedures for missing values will cause adverse impacts on subsequent statistical analyses. However, few imputation methods have been developed and applied to the situation of MNAR in the field of metabolomics. Thus, a practical left-censored missing value imputation method is urgently needed. We have developed an iterative Gibbs sampler based left-censored missing value imputation approach (GSimp). We compared GSimp with other three imputation methods on two real-world targeted metabolomics datasets and one simulation dataset using our imputation evaluation pipeline. The results show that GSimp outperforms other imputation methods in terms of imputation accuracy, observation distribution, univariate and multivariate analyses, and statistical sensitivity. The R code for GSimp, evaluation pipeline, vignette, real-world and simulated targeted metabolomics datasets are available at: https://github.com/WandeRum/GSimp.Author summaryMissing values caused by the limit of detection/quantification (LOD/LOQ) were widely observed in mass spectrometry (MS)-based targeted metabolomics studies and could be recognized as missing not at random (MNAR). MNAR leads to biased parameter estimations and jeopardizes following statistical analyses in different aspects, such as distorting sample distribution, impairing statistical power, etc. Although a wide range of missing value imputation methods was developed for –omics studies, a limited number of methods was designed appropriately for the situation of MNAR currently. To alleviate problems caused by MNAR and facilitate targeted metabolomics studies, we developed a Gibbs sampler based missing value imputation approach, called GSimp, which is public-accessible on GitHub. And we compared our method with existing approaches using an imputation evaluation pipeline on real-world and simulated metabolomics datasets to demonstrate the superiority of our method from different perspectives.

Download Full-text