Missing Data Imputation on IoT Sensor Networks: Implications for on-site Sensor Calibration

10.36227/techrxiv.14986662 ◽

2021 ◽

Author(s):

Nwamaka Okafor ◽

Declan Delaney

Keyword(s):

Missing Data ◽

Missing Values ◽

Sensor Calibration ◽

Data Set ◽

Missing Data Imputation ◽

Substantial Impact ◽

Calibration Methods ◽

Real World Datasets ◽

Operational Issues ◽

Calibration Techniques

IoT sensors are becoming increasingly important supplement to traditional monitoring systems, particularly for in-situ based monitoring. Data collected using IoT sensors are often plagued with missing values occurring as a result of sensor faults, network failures, drifts and other operational issues. Missing data can have substantial impact on in-field sensor calibration methods. The goal of this research is to achieve effective calibration of sensors in the context of such missing data. To this end, two objectives are presented in this paper. 1) Identify and examine effective imputation strategy for missing data in IoT sensors. 2) Determine sensor calibration performance using calibration techniques on data set with imputed values. Specifically, this paper examines the performance of Variational Autoencoder (VAE), Neural Network with Random Weights (NNRW), Multiple Imputation by Chain Equations (MICE), Random forest based imputation (missForest) and K-Nearest Neighbour (KNN) for imputation of missing values on IoT sensors. Furthermore, the performance of sensor calibration via different supervised algorithms trained on the imputed dataset were evaluated. The analysis showed that VAE technique outperforms the others in imputing the missing values at different proportions of missingness on two real-world datasets. Experimental results also showed improved calibration performance with imputed dataset.

Download Full-text

Missing Data Imputation on IoT Sensor Networks: Implications for on-site Sensor Calibration

10.36227/techrxiv.13633529.v2 ◽

2021 ◽

Author(s):

Nwamaka Okafor ◽

Declan Delaney

Keyword(s):

Missing Data ◽

Data Collection ◽

Missing Values ◽

Sensor Calibration ◽

Monitoring Systems ◽

Data Imputation ◽

Sensor Faults ◽

Missing Data Imputation ◽

Operational Issues

IoT sensors are becoming increasingly important supplement to traditional monitoring systems, particularly for in-situ based monitoring. However, data collection based on IoT sensors are often plagued with missing values usually occurring as a result of sensor faults, network failures, drifts and other operational issues. <br>

Download Full-text

Reviewing Autoencoders for Missing Data Imputation: Technical Trends, Applications and Outcomes

Journal of Artificial Intelligence Research ◽

10.1613/jair.1.12312 ◽

2020 ◽

Vol 69 ◽

pp. 1255-1285

Author(s):

Ricardo Cardoso Pereira ◽

Miriam Seoane Santos ◽

Pedro Pereira Rodrigues ◽

Pedro Henriques Abreu

Keyword(s):

Missing Data ◽

Missing Values ◽

State Of The Art ◽

Data Imputation ◽

Tabular Data ◽

Missing Data Imputation ◽

Learning Techniques ◽

Real World Datasets ◽

And Training ◽

Machine Learning Models

Missing data is a problem often found in real-world datasets and it can degrade the performance of most machine learning models. Several deep learning techniques have been used to address this issue, and one of them is the Autoencoder and its Denoising and Variational variants. These models are able to learn a representation of the data with missing values and generate plausible new ones to replace them. This study surveys the use of Autoencoders for the imputation of tabular data and considers 26 works published between 2014 and 2020. The analysis is mainly focused on discussing patterns and recommendations for the architecture, hyperparameters and training settings of the network, while providing a detailed discussion of the results obtained by Autoencoders when compared to other state-of-the-art methods, and of the data contexts where they have been applied. The conclusions include a set of recommendations for the technical settings of the network, and show that Denoising Autoencoders outperform their competitors, particularly the often used statistical methods.

Download Full-text

Comparison of Missing Data Infilling Mechanisms for Recovering a Real-World Single Station Streamflow Observation

International Journal of Environmental Research and Public Health ◽

10.3390/ijerph18168375 ◽

2021 ◽

Vol 18 (16) ◽

pp. 8375

Author(s):

Thelma Dede Baddoo ◽

Zhijia Li ◽

Samuel Nii Odai ◽

Kenneth Rodolphe Chabi Boni ◽

Isaac Kwesi Nooni ◽

...

Keyword(s):

Missing Data ◽

Multiple Imputation ◽

Real World ◽

Missing Values ◽

Total Error ◽

Extensive Study ◽

Error Measurement ◽

Missing Data Imputation ◽

Single Station ◽

Real World Datasets

Reconstructing missing streamflow data can be challenging when additional data are not available, and missing data imputation of real-world datasets to investigate how to ascertain the accuracy of imputation algorithms for these datasets are lacking. This study investigated the necessary complexity of missing data reconstruction schemes to obtain the relevant results for a real-world single station streamflow observation to facilitate its further use. This investigation was implemented by applying different missing data mechanisms spanning from univariate algorithms to multiple imputation methods accustomed to multivariate data taking time as an explicit variable. The performance accuracy of these schemes was assessed using the total error measurement (TEM) and a recommended localized error measurement (LEM) in this study. The results show that univariate missing value algorithms, which are specially developed to handle univariate time series, provide satisfactory results, but the ones which provide the best results are usually time and computationally intensive. Also, multiple imputation algorithms which consider the surrounding observed values and/or which can understand the characteristics of the data provide similar results to the univariate missing data algorithms and, in some cases, perform better without the added time and computational downsides when time is taken as an explicit variable. Furthermore, the LEM would be especially useful when the missing data are in specific portions of the dataset or where very large gaps of ‘missingness’ occur. Finally, proper handling of missing values of real-world hydroclimatic datasets depends on imputing and extensive study of the particular dataset to be imputed.

Download Full-text

MISSING DATA IMPUTATION ON IOT SENSOR NETWORKS: IMPLICATIONS FOR ON-SITE SENSOR CALIBRATION

10.36227/techrxiv.13633529 ◽

2021 ◽

Author(s):

Nwamaka Okafor

Keyword(s):

Sensor Networks ◽

Missing Data ◽

Random Forest ◽

Data Collection ◽

Missing Values ◽

Imputation Accuracy ◽

Sensor Calibration ◽

Concentration Data ◽

Missing Data Imputation

IoT sensors are gaining more popularity in the environmental monitoring space due to their relatively small size, cost of acquisition and ease of installation and operation. They are becoming increasingly important<br>supplement to traditional monitoring systems, particularly for in-situ based monitoring. However, data collection based on IoT sensors are often plagued with missing values usually occurring as a result of sensor faults, network failures, drifts and other operational issues. Several imputation strategies have been proposed for handling missing values in various application domains. This paper examines the performance of different imputation techniques including Multiple Imputation by Chain Equations (MICE), Random forest based imputation (missForest) and K-Nearest Neighbour (KNN) for handling missing values on sensor networks deployed for the quantification of Green House Gases(GHGs). Two tasks were conducted: first, Ozone (O3) and NO2/O3 concentration data collected using Aeroqual and Cairclip sensors respectively over a six months data collection period were corrupted by removing data intervals at different missing periods (p) where p 2 f1day; 1week; 2weeks; 1monthg and also at random points on the dataset at varying proportion (r) where r 2 f5%; 10%; 30%; 50%; 70%g. The missing data were then filled using the different imputation strategies and their imputation accuracy calculated. Second, the performance of sensor calibration by different regression models including Multi Linear Regression (MLR), Decision Tree (DT), Random Forest (RF) and XGBoost (XGB) trained on the different imputed datasets were evaluated. The analysis showed the MICE technique to outperform the others in imputing the missing values on both the O3 and NO2/O3 datasets when missingness was introduced over periods p. MissForest, however, outperformed the rest when missingness was introduced as randomly occuring point errors. While the analysis demonstrated the effects of missing and imputed data on sensor calibration, experimental results showed that a simple model on the imputed dataset can achieve state of-the-art result on in-situ sensor calibration, improving the data quality of the sensor.

Download Full-text

Classifiers Accuracy Improvement Based on Missing Data Imputation

Journal of Artificial Intelligence and Soft Computing Research ◽

10.1515/jaiscr-2018-0002 ◽

2018 ◽

Vol 8 (1) ◽

pp. 31-48 ◽

Cited By ~ 11

Author(s):

Ivan Jordanov ◽

Nedyalko Petrov ◽

Alessio Petrozziello

Keyword(s):

Classification Accuracy ◽

Missing Values ◽

Statistical Significance ◽

Roc Curves ◽

Radar Signal ◽

Support Vector ◽

Data Set ◽

Missing Data Imputation ◽

Vector Machines ◽

Real World Datasets

Abstract In this paper we investigate further and extend our previous work on radar signal identification and classification based on a data set which comprises continuous, discrete and categorical data that represent radar pulse train characteristics such as signal frequencies, pulse repetition, type of modulation, intervals, scan period, scanning type, etc. As the most of the real world datasets, it also contains high percentage of missing values and to deal with this problem we investigate three imputation techniques: Multiple Imputation (MI); K-Nearest Neighbour Imputation (KNNI); and Bagged Tree Imputation (BTI). We apply these methods to data samples with up to 60% missingness, this way doubling the number of instances with complete values in the resulting dataset. The imputation models performance is assessed with Wilcoxon’s test for statistical significance and Cohen’s effect size metrics. To solve the classification task, we employ three intelligent approaches: Neural Networks (NN); Support Vector Machines (SVM); and Random Forests (RF). Subsequently, we critically analyse which imputation method influences most the classifiers’ performance, using a multiclass classification accuracy metric, based on the area under the ROC curves. We consider two superclasses (‘military’ and ‘civil’), each containing several ‘subclasses’, and introduce and propose two new metrics: inner class accuracy (IA); and outer class accuracy (OA), in addition to the overall classification accuracy (OCA) metric. We conclude that they can be used as complementary to the OCA when choosing the best classifier for the problem at hand.

Download Full-text

MISSING DATA IMPUTATION ON IOT SENSOR NETWORKS: IMPLICATIONS FOR ON-SITE SENSOR CALIBRATION

10.36227/techrxiv.13633529.v1 ◽

2021 ◽

Author(s):

Nwamaka Okafor

Keyword(s):

Sensor Networks ◽

Missing Data ◽

Random Forest ◽

Data Collection ◽

Missing Values ◽

Imputation Accuracy ◽

Sensor Calibration ◽

Concentration Data ◽

Missing Data Imputation

IoT sensors are gaining more popularity in the environmental monitoring space due to their relatively small size, cost of acquisition and ease of installation and operation. They are becoming increasingly important<br>supplement to traditional monitoring systems, particularly for in-situ based monitoring. However, data collection based on IoT sensors are often plagued with missing values usually occurring as a result of sensor faults, network failures, drifts and other operational issues. Several imputation strategies have been proposed for handling missing values in various application domains. This paper examines the performance of different imputation techniques including Multiple Imputation by Chain Equations (MICE), Random forest based imputation (missForest) and K-Nearest Neighbour (KNN) for handling missing values on sensor networks deployed for the quantification of Green House Gases(GHGs). Two tasks were conducted: first, Ozone (O3) and NO2/O3 concentration data collected using Aeroqual and Cairclip sensors respectively over a six months data collection period were corrupted by removing data intervals at different missing periods (p) where p 2 f1day; 1week; 2weeks; 1monthg and also at random points on the dataset at varying proportion (r) where r 2 f5%; 10%; 30%; 50%; 70%g. The missing data were then filled using the different imputation strategies and their imputation accuracy calculated. Second, the performance of sensor calibration by different regression models including Multi Linear Regression (MLR), Decision Tree (DT), Random Forest (RF) and XGBoost (XGB) trained on the different imputed datasets were evaluated. The analysis showed the MICE technique to outperform the others in imputing the missing values on both the O3 and NO2/O3 datasets when missingness was introduced over periods p. MissForest, however, outperformed the rest when missingness was introduced as randomly occuring point errors. While the analysis demonstrated the effects of missing and imputed data on sensor calibration, experimental results showed that a simple model on the imputed dataset can achieve state of-the-art result on in-situ sensor calibration, improving the data quality of the sensor.

Download Full-text

Missing Data Imputation – A Survey

International Journal of Decision Support System Technology ◽

10.4018/ijdsst.292446 ◽

2022 ◽

Vol 14 (1) ◽

pp. 0-0

Keyword(s):

Missing Data ◽

Linear Regression ◽

Missing Values ◽

Computational Cost ◽

Machine Learning Algorithms ◽

Classification And Regression Tree ◽

High Dimensional ◽

Missing Data Imputation ◽

Real World Datasets ◽

Incomplete Datasets

Many real world datasets may contain missing values for various reasons. These incomplete datasets can pose severe issues to the underlying machine learning algorithms and decision support systems. It may result in high computational cost, skewed output and invalid deductions. Various solutions exist to mitigate this issue; the most popular strategy is to estimate the missing values by applying inferential techniques such as linear regression, decision trees or Bayesian inference. In this paper, the missing data problem is discussed in detail with a comprehensive review of the approaches to tackle it. The paper concludes with a discussion on the effectiveness of three imputation methods namely, imputation based on Multiple Linear Regression (MLR), Predictive Mean Matching (PMM) and Classification And Regression Tree (CART) in the context of subspace clustering. The experimental results obtained on real benchmark datasets and high-dimensional synthetic datasets highlight that, MLR based imputation method is more efficient on high-dimensional incomplete datasets.

Download Full-text

Handling Complex Missing Data Using Random Forest Approach for an Air Quality Monitoring Dataset: A Case Study of Kuwait Environmental Data (2012 to 2018)

International Journal of Environmental Research and Public Health ◽

10.3390/ijerph18031333 ◽

2021 ◽

Vol 18 (3) ◽

pp. 1333

Author(s):

Ahmad R. Alsaber ◽

Jiazhu Pan ◽

Adeeba Al-Hurban

Keyword(s):

Air Quality ◽

Missing Data ◽

Random Forest ◽

Missing Values ◽

Imputation Method ◽

Environmental Data ◽

Environmental Research ◽

Quality Data ◽

Data Set ◽

Air Quality Data

In environmental research, missing data are often a challenge for statistical modeling. This paper addressed some advanced techniques to deal with missing values in a data set measuring air quality using a multiple imputation (MI) approach. MCAR, MAR, and NMAR missing data techniques are applied to the data set. Five missing data levels are considered: 5%, 10%, 20%, 30%, and 40%. The imputation method used in this paper is an iterative imputation method, missForest, which is related to the random forest approach. Air quality data sets were gathered from five monitoring stations in Kuwait, aggregated to a daily basis. Logarithm transformation was carried out for all pollutant data, in order to normalize their distributions and to minimize skewness. We found high levels of missing values for NO2 (18.4%), CO (18.5%), PM10 (57.4%), SO2 (19.0%), and O3 (18.2%) data. Climatological data (i.e., air temperature, relative humidity, wind direction, and wind speed) were used as control variables for better estimation. The results show that the MAR technique had the lowest RMSE and MAE. We conclude that MI using the missForest approach has a high level of accuracy in estimating missing values. MissForest had the lowest imputation error (RMSE and MAE) among the other imputation methods and, thus, can be considered to be appropriate for analyzing air quality data.

Download Full-text

Missing Data Imputation on IoT Sensor Networks: Implications for on-site Sensor Calibration

IEEE Sensors Journal ◽

10.1109/jsen.2021.3105442 ◽

2021 ◽

pp. 1-1

Author(s):

Nwamaka U. Okafor ◽

Declan T. Delaney

Keyword(s):

Sensor Networks ◽

Missing Data ◽

Sensor Calibration ◽

Data Imputation ◽

Missing Data Imputation

Download Full-text