A data-driven missing value imputation approach for longitudinal datasets

Artificial Intelligence Review ◽

10.1007/s10462-021-09963-5 ◽

2021 ◽

Author(s):

Caio Ribeiro ◽

Alex A. Freitas

Keyword(s):

Missing Data ◽

Longitudinal Data ◽

Missing Values ◽

Error Rates ◽

Imputation Method ◽

Data Driven ◽

Missing Value ◽

Missing Value Imputation ◽

Human Ageing ◽

Imputation Approach

AbstractLongitudinal datasets of human ageing studies usually have a high volume of missing data, and one way to handle missing values in a dataset is to replace them with estimations. However, there are many methods to estimate missing values, and no single method is the best for all datasets. In this article, we propose a data-driven missing value imputation approach that performs a feature-wise selection of the best imputation method, using known information in the dataset to rank the five methods we selected, based on their estimation error rates. We evaluated the proposed approach in two sets of experiments: a classifier-independent scenario, where we compared the applicabilities and error rates of each imputation method; and a classifier-dependent scenario, where we compared the predictive accuracy of Random Forest classifiers generated with datasets prepared using each imputation method and a baseline approach of doing no imputation (letting the classification algorithm handle the missing values internally). Based on our results from both sets of experiments, we concluded that the proposed data-driven missing value imputation approach generally resulted in models with more accurate estimations for missing data and better performing classifiers, in longitudinal datasets of human ageing. We also observed that imputation methods devised specifically for longitudinal data had very accurate estimations. This reinforces the idea that using the temporal information intrinsic to longitudinal data is a worthwhile endeavour for machine learning applications, and that can be achieved through the proposed data-driven approach.

Download Full-text

GSimp: A Gibbs sampler based left-censored missing value imputation approach for metabolomics studies

10.1101/177410 ◽

2017 ◽

Author(s):

Runmin Wei ◽

Jingye Wang ◽

Erik Jia ◽

Tianlu Chen ◽

Yan Ni ◽

...

Keyword(s):

Gibbs Sampler ◽

Real World ◽

Missing Values ◽

Statistical Analyses ◽

Targeted Metabolomics ◽

Missing Not At Random ◽

Missing Value ◽

Imputation Methods ◽

Missing Value Imputation ◽

Imputation Approach

AbstractLeft-censored missing values commonly exist in targeted metabolomics datasets and can be considered as missing not at random (MNAR). Improper data processing procedures for missing values will cause adverse impacts on subsequent statistical analyses. However, few imputation methods have been developed and applied to the situation of MNAR in the field of metabolomics. Thus, a practical left-censored missing value imputation method is urgently needed. We have developed an iterative Gibbs sampler based left-censored missing value imputation approach (GSimp). We compared GSimp with other three imputation methods on two real-world targeted metabolomics datasets and one simulation dataset using our imputation evaluation pipeline. The results show that GSimp outperforms other imputation methods in terms of imputation accuracy, observation distribution, univariate and multivariate analyses, and statistical sensitivity. The R code for GSimp, evaluation pipeline, vignette, real-world and simulated targeted metabolomics datasets are available at: https://github.com/WandeRum/GSimp.Author summaryMissing values caused by the limit of detection/quantification (LOD/LOQ) were widely observed in mass spectrometry (MS)-based targeted metabolomics studies and could be recognized as missing not at random (MNAR). MNAR leads to biased parameter estimations and jeopardizes following statistical analyses in different aspects, such as distorting sample distribution, impairing statistical power, etc. Although a wide range of missing value imputation methods was developed for –omics studies, a limited number of methods was designed appropriately for the situation of MNAR currently. To alleviate problems caused by MNAR and facilitate targeted metabolomics studies, we developed a Gibbs sampler based missing value imputation approach, called GSimp, which is public-accessible on GitHub. And we compared our method with existing approaches using an imputation evaluation pipeline on real-world and simulated metabolomics datasets to demonstrate the superiority of our method from different perspectives.

Download Full-text

Missing value imputation on multidimensional time series

Proceedings of the VLDB Endowment ◽

10.14778/3476249.3476300 ◽

2021 ◽

Vol 14 (11) ◽

pp. 2533-2545

Author(s):

Parikshit Bansal ◽

Prathamesh Deshpande ◽

Sunita Sarawagi

Keyword(s):

Time Series ◽

Deep Learning ◽

Missing Data ◽

Matrix Factorization ◽

Missing Values ◽

Learning Methods ◽

Missing Value ◽

Missing Value Imputation ◽

Multidimensional Time Series ◽

Factorization Methods

We present DeepMVI, a deep learning method for missing value imputation in multidimensional time-series datasets. Missing values are commonplace in decision support platforms that aggregate data over long time stretches from disparate sources, whereas reliable data analytics calls for careful handling of missing data. One strategy is imputing the missing values, and a wide variety of algorithms exist spanning simple interpolation, matrix factorization methods like SVD, statistical models like Kalman filters, and recent deep learning methods. We show that often these provide worse results on aggregate analytics compared to just excluding the missing data. DeepMVI expresses the distribution of each missing value conditioned on coarse and fine-grained signals along a time series, and signals from correlated series at the same time. Instead of resorting to linearity assumptions of conventional matrix factorization methods, DeepMVI harnesses a flexible deep network to extract and combine these signals in an end-to-end manner. To prevent over-fitting with high-capacity neural networks, we design a robust parameter training with labeled data created using synthetic missing blocks around available indices. Our neural network uses a modular design with a novel temporal transformer with convolutional features, and kernel regression with learned embeddings. Experiments across ten real datasets, five different missing scenarios, comparing seven conventional and three deep learning methods show that DeepMVI is significantly more accurate, reducing error by more than 50% in more than half the cases, compared to the best existing method. Although slower than simpler matrix factorization methods, we justify the increased time overheads by showing that DeepMVI provides significantly more accurate imputation that finally impacts quality of downstream analytics.

Download Full-text

Missing Value Imputation Based on Gaussian Mixture Model for the Internet of Things

Mathematical Problems in Engineering ◽

10.1155/2015/548605 ◽

2015 ◽

Vol 2015 ◽

pp. 1-8 ◽

Cited By ~ 16

Author(s):

Xiaobo Yan ◽

Weiqing Xiong ◽

Liang Hu ◽

Feng Wang ◽

Kuo Zhao

Keyword(s):

Missing Data ◽

Internet Of Things ◽

Gaussian Mixture Model ◽

Mixture Model ◽

Missing Values ◽

Gaussian Mixture ◽

The Internet ◽

Missing Value ◽

Missing Value Imputation ◽

The Internet Of Things

This paper addresses missing value imputation for the Internet of Things (IoT). Nowadays, the IoT has been used widely and commonly by a variety of domains, such as transportation and logistics domain and healthcare domain. However, missing values are very common in the IoT for a variety of reasons, which results in the fact that the experimental data are incomplete. As a result of this, some work, which is related to the data of the IoT, can’t be carried out normally. And it leads to the reduction in the accuracy and reliability of the data analysis results. This paper, for the characteristics of the data itself and the features of missing data in IoT, divides the missing data into three types and defines three corresponding missing value imputation problems. Then, we propose three new models to solve the corresponding problems, and they are model of missing value imputation based on context and linear mean (MCL), model of missing value imputation based on binary search (MBS), and model of missing value imputation based on Gaussian mixture model (MGI). Experimental results showed that the three models can improve the accuracy, reliability, and stability of missing value imputation greatly and effectively.

Download Full-text

Handling Complex Missing Data Using Random Forest Approach for an Air Quality Monitoring Dataset: A Case Study of Kuwait Environmental Data (2012 to 2018)

International Journal of Environmental Research and Public Health ◽

10.3390/ijerph18031333 ◽

2021 ◽

Vol 18 (3) ◽

pp. 1333

Author(s):

Ahmad R. Alsaber ◽

Jiazhu Pan ◽

Adeeba Al-Hurban

Keyword(s):

Air Quality ◽

Missing Data ◽

Random Forest ◽

Missing Values ◽

Imputation Method ◽

Environmental Data ◽

Environmental Research ◽

Quality Data ◽

Data Set ◽

Air Quality Data

In environmental research, missing data are often a challenge for statistical modeling. This paper addressed some advanced techniques to deal with missing values in a data set measuring air quality using a multiple imputation (MI) approach. MCAR, MAR, and NMAR missing data techniques are applied to the data set. Five missing data levels are considered: 5%, 10%, 20%, 30%, and 40%. The imputation method used in this paper is an iterative imputation method, missForest, which is related to the random forest approach. Air quality data sets were gathered from five monitoring stations in Kuwait, aggregated to a daily basis. Logarithm transformation was carried out for all pollutant data, in order to normalize their distributions and to minimize skewness. We found high levels of missing values for NO2 (18.4%), CO (18.5%), PM10 (57.4%), SO2 (19.0%), and O3 (18.2%) data. Climatological data (i.e., air temperature, relative humidity, wind direction, and wind speed) were used as control variables for better estimation. The results show that the MAR technique had the lowest RMSE and MAE. We conclude that MI using the missForest approach has a high level of accuracy in estimating missing values. MissForest had the lowest imputation error (RMSE and MAE) among the other imputation methods and, thus, can be considered to be appropriate for analyzing air quality data.

Download Full-text

Kernel weighted least square approach for imputing missing values of metabolomics data

Scientific Reports ◽

10.1038/s41598-021-90654-0 ◽

2021 ◽

Vol 11 (1) ◽

Author(s):

Nishith Kumar ◽

Md. Aminul Hoque ◽

Masahiro Sugimoto

Keyword(s):

Missing Data ◽

Large Scale ◽

Missing Values ◽

Kernel Weight ◽

Least Square ◽

Data Matrix ◽

Data Imputation ◽

Metabolomics Data ◽

Missing Value ◽

Missing Data Imputation

AbstractMass spectrometry is a modern and sophisticated high-throughput analytical technique that enables large-scale metabolomic analyses. It yields a high-dimensional large-scale matrix (samples × metabolites) of quantified data that often contain missing cells in the data matrix as well as outliers that originate for several reasons, including technical and biological sources. Although several missing data imputation techniques are described in the literature, all conventional existing techniques only solve the missing value problems. They do not relieve the problems of outliers. Therefore, outliers in the dataset decrease the accuracy of the imputation. We developed a new kernel weight function-based proposed missing data imputation technique that resolves the problems of missing values and outliers. We evaluated the performance of the proposed method and other conventional and recently developed missing imputation techniques using both artificially generated data and experimentally measured data analysis in both the absence and presence of different rates of outliers. Performances based on both artificial data and real metabolomics data indicate the superiority of our proposed kernel weight-based missing data imputation technique to the existing alternatives. For user convenience, an R package of the proposed kernel weight-based missing value imputation technique was developed, which is available at https://github.com/NishithPaul/tWLSA.

Download Full-text

A Hierarchical Missing Value Imputation Method by Correlation-Based K-Nearest Neighbors

Advances in Intelligent Systems and Computing - Intelligent Systems and Applications ◽

10.1007/978-3-030-29516-5_38 ◽

2019 ◽

pp. 486-496

Author(s):

Xin Liu ◽

Xiaochen Lai ◽

Liyong Zhang

Keyword(s):

Nearest Neighbors ◽

Imputation Method ◽

K Nearest Neighbors ◽

Missing Value ◽

Missing Value Imputation

Download Full-text

Missing Value Imputation Method Using Separate Features Nearest Neighbors Algorithm

Computational Science – ICCS 2021 - Lecture Notes in Computer Science ◽

10.1007/978-3-030-77967-2_12 ◽

2021 ◽

pp. 128-141

Author(s):

Tomasz Orczyk ◽

Rafał Doroz ◽

Piotr Porwik

Keyword(s):

Nearest Neighbors ◽

Imputation Method ◽

Missing Value ◽

Missing Value Imputation

Download Full-text

Machine Learning-Based Missing Value Imputation Method for Clinical Datasets

Lecture Notes in Electrical Engineering - IAENG Transactions on Engineering Technologies ◽

10.1007/978-94-007-6190-2_19 ◽

2013 ◽

pp. 245-257 ◽

Cited By ~ 13

Author(s):

M. Mostafizur Rahman ◽

D. N. Davis

Keyword(s):

Machine Learning ◽

Imputation Method ◽

Missing Value ◽

Missing Value Imputation

Download Full-text

Missing value imputation method for disaster decision-making using K nearest neighbor

Journal of Applied Statistics ◽

10.1080/02664763.2015.1077377 ◽

2015 ◽

Vol 43 (4) ◽

pp. 767-781 ◽

Cited By ~ 2

Author(s):

Xiaofei Ma ◽

Qiuyan Zhong

Keyword(s):

Decision Making ◽

Nearest Neighbor ◽

Imputation Method ◽

K Nearest Neighbor ◽

Missing Value ◽

Missing Value Imputation

Download Full-text

Outlier Removal in Model-Based Missing Value Imputation for Medical Datasets

Journal of Healthcare Engineering ◽

10.1155/2018/1817479 ◽

2018 ◽

Vol 2018 ◽

pp. 1-9 ◽

Cited By ~ 7

Author(s):

Min-Wei Huang ◽

Wei-Chao Lin ◽

Chih-Fong Tsai

Keyword(s):

Missing Values ◽

Positive Impact ◽

Numerical Data ◽

Data Type ◽

Mixed Data ◽

Instance Selection ◽

Missing Value ◽

Missing Value Imputation ◽

Noisy Information ◽

Selection Algorithms

Many real-world medical datasets contain some proportion of missing (attribute) values. In general, missing value imputation can be performed to solve this problem, which is to provide estimations for the missing values by a reasoning process based on the (complete) observed data. However, if the observed data contain some noisy information or outliers, the estimations of the missing values may not be reliable or may even be quite different from the real values. The aim of this paper is to examine whether a combination of instance selection from the observed data and missing value imputation offers better performance than performing missing value imputation alone. In particular, three instance selection algorithms, DROP3, GA, and IB3, and three imputation algorithms, KNNI, MLP, and SVM, are used in order to find out the best combination. The experimental results show that that performing instance selection can have a positive impact on missing value imputation over the numerical data type of medical datasets, and specific combinations of instance selection and imputation methods can improve the imputation results over the mixed data type of medical datasets. However, instance selection does not have a definitely positive impact on the imputation result for categorical medical datasets.

Download Full-text