Deep learning for missing value imputation of continuous data and the effect of data discretization

Missing Value Imputation of Time-Series Air-Quality Data via Deep Neural Networks

International Journal of Environmental Research and Public Health ◽

10.3390/ijerph182212213 ◽

2021 ◽

Vol 18 (22) ◽

pp. 12213

Author(s):

Taesung Kim ◽

Jinhee Kim ◽

Wonho Yang ◽

Hunjoo Lee ◽

Jaegul Choo

Keyword(s):

Time Series ◽

Deep Learning ◽

Air Quality ◽

Time Series Data ◽

Quality Data ◽

Series Data ◽

Missing Value ◽

Missing Value Imputation ◽

Spatio Temporal ◽

Air Quality Data

To prevent severe air pollution, it is important to analyze time-series air quality data, but this is often challenging as the time-series data is usually partially missing, especially when it is collected from multiple locations simultaneously. To solve this problem, various deep-learning-based missing value imputation models have been proposed. However, often they are barely interpretable, which makes it difficult to analyze the imputed data. Thus, we propose a novel deep learning-based imputation model that achieves high interpretability as well as shows great performance in missing value imputation for spatio-temporal data. We verify the effectiveness of our method through quantitative and qualitative results on a publicly available air-quality dataset.

Download Full-text

A robust deep learning model for missing value imputation in big NCDC dataset

Iran Journal of Computer Science ◽

10.1007/s42044-020-00065-z ◽

2020 ◽

Author(s):

Ibrahim Gad ◽

Doreswamy Hosahalli ◽

B. R. Manjunatha ◽

Osama A. Ghoneim

Keyword(s):

Deep Learning ◽

Learning Model ◽

Missing Value ◽

Missing Value Imputation ◽

Deep Learning Model

Download Full-text

Missing value imputation on multidimensional time series

Proceedings of the VLDB Endowment ◽

10.14778/3476249.3476300 ◽

2021 ◽

Vol 14 (11) ◽

pp. 2533-2545

Author(s):

Parikshit Bansal ◽

Prathamesh Deshpande ◽

Sunita Sarawagi

Keyword(s):

Time Series ◽

Deep Learning ◽

Missing Data ◽

Matrix Factorization ◽

Missing Values ◽

Learning Methods ◽

Missing Value ◽

Missing Value Imputation ◽

Multidimensional Time Series ◽

Factorization Methods

We present DeepMVI, a deep learning method for missing value imputation in multidimensional time-series datasets. Missing values are commonplace in decision support platforms that aggregate data over long time stretches from disparate sources, whereas reliable data analytics calls for careful handling of missing data. One strategy is imputing the missing values, and a wide variety of algorithms exist spanning simple interpolation, matrix factorization methods like SVD, statistical models like Kalman filters, and recent deep learning methods. We show that often these provide worse results on aggregate analytics compared to just excluding the missing data. DeepMVI expresses the distribution of each missing value conditioned on coarse and fine-grained signals along a time series, and signals from correlated series at the same time. Instead of resorting to linearity assumptions of conventional matrix factorization methods, DeepMVI harnesses a flexible deep network to extract and combine these signals in an end-to-end manner. To prevent over-fitting with high-capacity neural networks, we design a robust parameter training with labeled data created using synthetic missing blocks around available indices. Our neural network uses a modular design with a novel temporal transformer with convolutional features, and kernel regression with learned embeddings. Experiments across ten real datasets, five different missing scenarios, comparing seven conventional and three deep learning methods show that DeepMVI is significantly more accurate, reducing error by more than 50% in more than half the cases, compared to the best existing method. Although slower than simpler matrix factorization methods, we justify the increased time overheads by showing that DeepMVI provides significantly more accurate imputation that finally impacts quality of downstream analytics.

Download Full-text

Enriching Integrated Statistical Open City Data by Combining Equational Knowledge and Missing Value Imputation

SSRN Electronic Journal ◽

10.2139/ssrn.3199313 ◽

2018 ◽

Author(s):

Stefan Bischof ◽

Andreas Harth ◽

Benedikt KKmpgen ◽

Axel Polleres ◽

Patrik Schneider

Keyword(s):

Missing Value ◽

Missing Value Imputation

Download Full-text

Effective Missing Value Imputation Methods for Building Monitoring Data

2020 IEEE International Conference on Big Data (Big Data) ◽

10.1109/bigdata50022.2020.9378230 ◽

2020 ◽

Author(s):

Brian Cho ◽

Teresa Dayrit ◽

Yuan Gao ◽

Zhe Wang ◽

Tianzhen Hong ◽

...

Keyword(s):

Monitoring Data ◽

Missing Value ◽

Imputation Methods ◽

Missing Value Imputation ◽

Building Monitoring

Download Full-text

IFGAN: Missing Value Imputation using Feature-specific Generative Adversarial Networks

2020 IEEE International Conference on Big Data (Big Data) ◽

10.1109/bigdata50022.2020.9378240 ◽

2020 ◽

Author(s):

Wei Qiu ◽

Yangsibo Huang ◽

Quanzheng Li

Keyword(s):

Generative Adversarial Networks ◽

Missing Value ◽

Missing Value Imputation ◽

Adversarial Networks

Download Full-text

A data-driven missing value imputation approach for longitudinal datasets

Artificial Intelligence Review ◽

10.1007/s10462-021-09963-5 ◽

2021 ◽

Author(s):

Caio Ribeiro ◽

Alex A. Freitas

Keyword(s):

Missing Data ◽

Longitudinal Data ◽

Missing Values ◽

Error Rates ◽

Imputation Method ◽

Data Driven ◽

Missing Value ◽

Missing Value Imputation ◽

Human Ageing ◽

Imputation Approach

AbstractLongitudinal datasets of human ageing studies usually have a high volume of missing data, and one way to handle missing values in a dataset is to replace them with estimations. However, there are many methods to estimate missing values, and no single method is the best for all datasets. In this article, we propose a data-driven missing value imputation approach that performs a feature-wise selection of the best imputation method, using known information in the dataset to rank the five methods we selected, based on their estimation error rates. We evaluated the proposed approach in two sets of experiments: a classifier-independent scenario, where we compared the applicabilities and error rates of each imputation method; and a classifier-dependent scenario, where we compared the predictive accuracy of Random Forest classifiers generated with datasets prepared using each imputation method and a baseline approach of doing no imputation (letting the classification algorithm handle the missing values internally). Based on our results from both sets of experiments, we concluded that the proposed data-driven missing value imputation approach generally resulted in models with more accurate estimations for missing data and better performing classifiers, in longitudinal datasets of human ageing. We also observed that imputation methods devised specifically for longitudinal data had very accurate estimations. This reinforces the idea that using the temporal information intrinsic to longitudinal data is a worthwhile endeavour for machine learning applications, and that can be achieved through the proposed data-driven approach.

Download Full-text

A systematic review of machine learning-based missing value imputation techniques

Data Technologies and Applications ◽

10.1108/dta-12-2020-0298 ◽

2021 ◽

Vol ahead-of-print (ahead-of-print) ◽

Author(s):

Tressy Thomas ◽

Enayat Rajabi

Keyword(s):

Machine Learning ◽

Selection Process ◽

Evaluation Metrics ◽

Correct Prediction ◽

Data Sets ◽

Data Set ◽

Missing Value ◽

Content Type ◽

Missing Value Imputation ◽

Literature Reviews

PurposeThe primary aim of this study is to review the studies from different dimensions including type of methods, experimentation setup and evaluation metrics used in the novel approaches proposed for data imputation, particularly in the machine learning (ML) area. This ultimately provides an understanding about how well the proposed framework is evaluated and what type and ratio of missingness are addressed in the proposals. The review questions in this study are (1) what are the ML-based imputation methods studied and proposed during 2010–2020? (2) How the experimentation setup, characteristics of data sets and missingness are employed in these studies? (3) What metrics were used for the evaluation of imputation method?Design/methodology/approachThe review process went through the standard identification, screening and selection process. The initial search on electronic databases for missing value imputation (MVI) based on ML algorithms returned a large number of papers totaling at 2,883. Most of the papers at this stage were not exactly an MVI technique relevant to this study. The literature reviews are first scanned in the title for relevancy, and 306 literature reviews were identified as appropriate. Upon reviewing the abstract text, 151 literature reviews that are not eligible for this study are dropped. This resulted in 155 research papers suitable for full-text review. From this, 117 papers are used in assessment of the review questions.FindingsThis study shows that clustering- and instance-based algorithms are the most proposed MVI methods. Percentage of correct prediction (PCP) and root mean square error (RMSE) are most used evaluation metrics in these studies. For experimentation, majority of the studies sourced the data sets from publicly available data set repositories. A common approach is that the complete data set is set as baseline to evaluate the effectiveness of imputation on the test data sets with artificially induced missingness. The data set size and missingness ratio varied across the experimentations, while missing datatype and mechanism are pertaining to the capability of imputation. Computational expense is a concern, and experimentation using large data sets appears to be a challenge.Originality/valueIt is understood from the review that there is no single universal solution to missing data problem. Variants of ML approaches work well with the missingness based on the characteristics of the data set. Most of the methods reviewed lack generalization with regard to applicability. Another concern related to applicability is the complexity of the formulation and implementation of the algorithm. Imputations based on k-nearest neighbors (kNN) and clustering algorithms which are simple and easy to implement make it popular across various domains.

Download Full-text

Missing value imputation in multi-environment trials: Reconsidering the Krzanowski method

Cropp Breeding and Applied Biotechnology ◽

10.1590/1984-70332016v16n2a13 ◽

2016 ◽

Vol 16 (2) ◽

pp. 77-85 ◽

Cited By ~ 2

Author(s):

Sergio Arciniegas-Alarcón ◽

Marisol García-Peña ◽

Wojtek Krzanowski

Keyword(s):

Missing Value ◽

Missing Value Imputation

Download Full-text

Missing value imputation for microarray gene expression data using histone acetylation information

BMC Bioinformatics ◽

10.1186/1471-2105-9-252 ◽

2008 ◽

Vol 9 (1) ◽

Cited By ~ 24

Author(s):

Qian Xiang ◽

Xianhua Dai ◽

Yangyang Deng ◽

Caisheng He ◽

Jiang Wang ◽

...

Keyword(s):

Gene Expression ◽

Histone Acetylation ◽

Gene Expression Data ◽

Microarray Gene Expression Data ◽

Expression Data ◽

Microarray Gene Expression ◽

Missing Value ◽

Missing Value Imputation ◽

Microarray Gene

Download Full-text