GSimp: A Gibbs sampler based left-censored missing value imputation approach for metabolomics studies

AbstractLeft-censored missing values commonly exist in targeted metabolomics datasets and can be considered as missing not at random (MNAR). Improper data processing procedures for missing values will cause adverse impacts on subsequent statistical analyses. However, few imputation methods have been developed and applied to the situation of MNAR in the field of metabolomics. Thus, a practical left-censored missing value imputation method is urgently needed. We have developed an iterative Gibbs sampler based left-censored missing value imputation approach (GSimp). We compared GSimp with other three imputation methods on two real-world targeted metabolomics datasets and one simulation dataset using our imputation evaluation pipeline. The results show that GSimp outperforms other imputation methods in terms of imputation accuracy, observation distribution, univariate and multivariate analyses, and statistical sensitivity. The R code for GSimp, evaluation pipeline, vignette, real-world and simulated targeted metabolomics datasets are available at: https://github.com/WandeRum/GSimp.Author summaryMissing values caused by the limit of detection/quantification (LOD/LOQ) were widely observed in mass spectrometry (MS)-based targeted metabolomics studies and could be recognized as missing not at random (MNAR). MNAR leads to biased parameter estimations and jeopardizes following statistical analyses in different aspects, such as distorting sample distribution, impairing statistical power, etc. Although a wide range of missing value imputation methods was developed for –omics studies, a limited number of methods was designed appropriately for the situation of MNAR currently. To alleviate problems caused by MNAR and facilitate targeted metabolomics studies, we developed a Gibbs sampler based missing value imputation approach, called GSimp, which is public-accessible on GitHub. And we compared our method with existing approaches using an imputation evaluation pipeline on real-world and simulated metabolomics datasets to demonstrate the superiority of our method from different perspectives.

Download Full-text

A data-driven missing value imputation approach for longitudinal datasets

Artificial Intelligence Review ◽

10.1007/s10462-021-09963-5 ◽

2021 ◽

Author(s):

Caio Ribeiro ◽

Alex A. Freitas

Keyword(s):

Missing Data ◽

Longitudinal Data ◽

Missing Values ◽

Error Rates ◽

Imputation Method ◽

Data Driven ◽

Missing Value ◽

Missing Value Imputation ◽

Human Ageing ◽

Imputation Approach

AbstractLongitudinal datasets of human ageing studies usually have a high volume of missing data, and one way to handle missing values in a dataset is to replace them with estimations. However, there are many methods to estimate missing values, and no single method is the best for all datasets. In this article, we propose a data-driven missing value imputation approach that performs a feature-wise selection of the best imputation method, using known information in the dataset to rank the five methods we selected, based on their estimation error rates. We evaluated the proposed approach in two sets of experiments: a classifier-independent scenario, where we compared the applicabilities and error rates of each imputation method; and a classifier-dependent scenario, where we compared the predictive accuracy of Random Forest classifiers generated with datasets prepared using each imputation method and a baseline approach of doing no imputation (letting the classification algorithm handle the missing values internally). Based on our results from both sets of experiments, we concluded that the proposed data-driven missing value imputation approach generally resulted in models with more accurate estimations for missing data and better performing classifiers, in longitudinal datasets of human ageing. We also observed that imputation methods devised specifically for longitudinal data had very accurate estimations. This reinforces the idea that using the temporal information intrinsic to longitudinal data is a worthwhile endeavour for machine learning applications, and that can be achieved through the proposed data-driven approach.

Download Full-text

Missing Value Imputation Approach for Mass Spectrometry-based Metabolomics Data

Scientific Reports ◽

10.1038/s41598-017-19120-0 ◽

2018 ◽

Vol 8 (1) ◽

Cited By ~ 93

Author(s):

Runmin Wei ◽

Jingye Wang ◽

Mingming Su ◽

Erik Jia ◽

Shaoqiu Chen ◽

...

Keyword(s):

Mass Spectrometry ◽

Metabolomics Data ◽

Missing Value ◽

Missing Value Imputation ◽

Imputation Approach

Download Full-text

APPLICATION OF AN EMPIRICAL GROWTH MODEL AND MULTIPLE IMPUTATION IN HARD DISK DRIVE FIELD RETURN PREDICTION

International Journal of Reliability Quality and Safety Engineering ◽

10.1142/s0218539310003950 ◽

2010 ◽

Vol 17 (06) ◽

pp. 565-577 ◽

Cited By ~ 6

Author(s):

SHAOANG ZHANG ◽

FENG-BIN SUN ◽

ROSS GOUGH

Keyword(s):

Hard Disk Drive ◽

Growth Model ◽

Survival Data ◽

Early Stage ◽

Hard Disk ◽

Disk Drive ◽

Model Estimation ◽

Missing Value ◽

Missing Value Imputation ◽

Imputation Approach

Appropriate model assumption and robust model estimation are critical for accurate hard disk drive (HDD) field return prediction. Parametric distribution models that are conventionally used in HDD reliability analyses do not describe well HDD field survival. A biological growth model, Bertalanfy-Richards growth model, is introduced in this paper to model HDD field survival within warranty period. A nonparametric multiple missing value imputation approach is used to improve model estimation when not enough observations of field returns are available at early stage of product life. Tests of the model and the missing value imputation approach using actual HDD field survival data suggest they perform well in describing and predicting HDD warranty field returns.

Download Full-text

rMisbeta: A Robust Missing Value Imputation Approach in Transcriptomics and Metabolomics Data

Computers in Biology and Medicine ◽

10.1016/j.compbiomed.2021.104911 ◽

2021 ◽

pp. 104911

Author(s):

Md. Shahjaman ◽

Md. Rezanur Rahman ◽

Tania Islam ◽

Md. Rabiul Auwul ◽

Mohammad Ali Moni ◽

...

Keyword(s):

Metabolomics Data ◽

Missing Value ◽

Missing Value Imputation ◽

Imputation Approach

Download Full-text

Missing Value Imputation Approach Using Cosine Similarity Measure

Advances in Intelligent Systems and Computing - International Conference on Innovative Computing and Communications ◽

10.1007/978-981-15-5113-0_44 ◽

2020 ◽

pp. 557-565

Author(s):

Wajeeha Rashid ◽

Sakshi Arora ◽

Manoj Kumar Gupta

Keyword(s):

Similarity Measure ◽

Cosine Similarity ◽

Missing Value ◽

Missing Value Imputation ◽

Cosine Similarity Measure ◽

Imputation Approach

Download Full-text

Enriching Integrated Statistical Open City Data by Combining Equational Knowledge and Missing Value Imputation

SSRN Electronic Journal ◽

10.2139/ssrn.3199313 ◽

2018 ◽

Author(s):

Stefan Bischof ◽

Andreas Harth ◽

Benedikt KKmpgen ◽

Axel Polleres ◽

Patrik Schneider

Keyword(s):

Missing Value ◽

Missing Value Imputation

Download Full-text

Effective Missing Value Imputation Methods for Building Monitoring Data

2020 IEEE International Conference on Big Data (Big Data) ◽

10.1109/bigdata50022.2020.9378230 ◽

2020 ◽

Author(s):

Brian Cho ◽

Teresa Dayrit ◽

Yuan Gao ◽

Zhe Wang ◽

Tianzhen Hong ◽

...

Keyword(s):

Monitoring Data ◽

Missing Value ◽

Imputation Methods ◽

Missing Value Imputation ◽

Building Monitoring

Download Full-text

IFGAN: Missing Value Imputation using Feature-specific Generative Adversarial Networks

2020 IEEE International Conference on Big Data (Big Data) ◽

10.1109/bigdata50022.2020.9378240 ◽

2020 ◽

Author(s):

Wei Qiu ◽

Yangsibo Huang ◽

Quanzheng Li

Keyword(s):

Generative Adversarial Networks ◽

Missing Value ◽

Missing Value Imputation ◽

Adversarial Networks

Download Full-text

A systematic review of machine learning-based missing value imputation techniques

Data Technologies and Applications ◽

10.1108/dta-12-2020-0298 ◽

2021 ◽

Vol ahead-of-print (ahead-of-print) ◽

Author(s):

Tressy Thomas ◽

Enayat Rajabi

Keyword(s):

Machine Learning ◽

Selection Process ◽

Evaluation Metrics ◽

Correct Prediction ◽

Data Sets ◽

Data Set ◽

Missing Value ◽

Content Type ◽

Missing Value Imputation ◽

Literature Reviews

PurposeThe primary aim of this study is to review the studies from different dimensions including type of methods, experimentation setup and evaluation metrics used in the novel approaches proposed for data imputation, particularly in the machine learning (ML) area. This ultimately provides an understanding about how well the proposed framework is evaluated and what type and ratio of missingness are addressed in the proposals. The review questions in this study are (1) what are the ML-based imputation methods studied and proposed during 2010–2020? (2) How the experimentation setup, characteristics of data sets and missingness are employed in these studies? (3) What metrics were used for the evaluation of imputation method?Design/methodology/approachThe review process went through the standard identification, screening and selection process. The initial search on electronic databases for missing value imputation (MVI) based on ML algorithms returned a large number of papers totaling at 2,883. Most of the papers at this stage were not exactly an MVI technique relevant to this study. The literature reviews are first scanned in the title for relevancy, and 306 literature reviews were identified as appropriate. Upon reviewing the abstract text, 151 literature reviews that are not eligible for this study are dropped. This resulted in 155 research papers suitable for full-text review. From this, 117 papers are used in assessment of the review questions.FindingsThis study shows that clustering- and instance-based algorithms are the most proposed MVI methods. Percentage of correct prediction (PCP) and root mean square error (RMSE) are most used evaluation metrics in these studies. For experimentation, majority of the studies sourced the data sets from publicly available data set repositories. A common approach is that the complete data set is set as baseline to evaluate the effectiveness of imputation on the test data sets with artificially induced missingness. The data set size and missingness ratio varied across the experimentations, while missing datatype and mechanism are pertaining to the capability of imputation. Computational expense is a concern, and experimentation using large data sets appears to be a challenge.Originality/valueIt is understood from the review that there is no single universal solution to missing data problem. Variants of ML approaches work well with the missingness based on the characteristics of the data set. Most of the methods reviewed lack generalization with regard to applicability. Another concern related to applicability is the complexity of the formulation and implementation of the algorithm. Imputations based on k-nearest neighbors (kNN) and clustering algorithms which are simple and easy to implement make it popular across various domains.

Download Full-text