Air quality data pre-processing: A novel algorithm to impute missing values in univariate time series

Author(s):  
Lakmini Wijesekara ◽  
Liwan Liyanage
Author(s):  
Ahmad R. Alsaber ◽  
Jiazhu Pan ◽  
Adeeba Al-Hurban 

In environmental research, missing data are often a challenge for statistical modeling. This paper addressed some advanced techniques to deal with missing values in a data set measuring air quality using a multiple imputation (MI) approach. MCAR, MAR, and NMAR missing data techniques are applied to the data set. Five missing data levels are considered: 5%, 10%, 20%, 30%, and 40%. The imputation method used in this paper is an iterative imputation method, missForest, which is related to the random forest approach. Air quality data sets were gathered from five monitoring stations in Kuwait, aggregated to a daily basis. Logarithm transformation was carried out for all pollutant data, in order to normalize their distributions and to minimize skewness. We found high levels of missing values for NO2 (18.4%), CO (18.5%), PM10 (57.4%), SO2 (19.0%), and O3 (18.2%) data. Climatological data (i.e., air temperature, relative humidity, wind direction, and wind speed) were used as control variables for better estimation. The results show that the MAR technique had the lowest RMSE and MAE. We conclude that MI using the missForest approach has a high level of accuracy in estimating missing values. MissForest had the lowest imputation error (RMSE and MAE) among the other imputation methods and, thus, can be considered to be appropriate for analyzing air quality data.


1975 ◽  
Vol 9 (11) ◽  
pp. 978-989 ◽  
Author(s):  
D.P. Chock ◽  
T.R. Terrell ◽  
S.B. Levitt

2014 ◽  
Vol 06 (02n03) ◽  
pp. 1450007
Author(s):  
RAYMOND K. W. WONG

The estimation and significance testing of the first-order autoregressive (AR1) coefficient in short time series with trends are examined. The purpose is to identify the difficulties to which analysis procedures need to adjust for better results. The delta recursive AR1 estimator rδand the Sen–Theil trend estimator are viable for short sequence application. Significance testing for rδhas low power. But the existence of trend has negligible influence in estimation and testing. The common practice of trend removal before AR1 estimation gives poorer results. Application to air quality data showed this could greatly change conclusions. Implication to analysis is discussed.


2015 ◽  
Vol 44 (3) ◽  
pp. 449-456 ◽  
Author(s):  
Nuryazmin Ahmat Zainuri ◽  
Abdul Aziz Jemain ◽  
Nora Muda

1977 ◽  
Vol 10 (7) ◽  
pp. 295-300
Author(s):  
Paolo Zannetti ◽  
Giovanna Finzi ◽  
Giorgio Fronza ◽  
Sergio Rinaldi

Author(s):  
Taesung Kim ◽  
Jinhee Kim ◽  
Wonho Yang ◽  
Hunjoo Lee ◽  
Jaegul Choo

To prevent severe air pollution, it is important to analyze time-series air quality data, but this is often challenging as the time-series data is usually partially missing, especially when it is collected from multiple locations simultaneously. To solve this problem, various deep-learning-based missing value imputation models have been proposed. However, often they are barely interpretable, which makes it difficult to analyze the imputed data. Thus, we propose a novel deep learning-based imputation model that achieves high interpretability as well as shows great performance in missing value imputation for spatio-temporal data. We verify the effectiveness of our method through quantitative and qualitative results on a publicly available air-quality dataset.


2020 ◽  
Vol 23 (6) ◽  
pp. 1129-1145
Author(s):  
Dezhan Qu ◽  
Xiaoli Lin ◽  
Ke Ren ◽  
Quanle Liu ◽  
Huijie Zhang

2020 ◽  
Vol 9 (2) ◽  
pp. 755-763
Author(s):  
Shamihah Muhammad Ghazali ◽  
Norshahida Shaadan ◽  
Zainura Idrus

Missing values often occur in many data sets of various research areas. This has been recognized as data quality problem because missing values could affect the performance of analysis results. To overcome the problem, the incomplete data set need to be treated or replaced using imputation method. Thus, exploring missing values pattern must be conducted beforehand to determine a suitable method. This paper discusses on the application of data visualisation as a smart technique for missing data exploration aiming to increase understanding on missing data behaviour which include missing data mechanism (MCAR, MAR and MNAR), distribution pattern of missingness in terms of percentage as well as the gap size. This paper presents the application of several data visualisation tools from five R-packges such as visdat, VIM, ggplot2, Amelia and UpSetR for data missingness exploration.  For an illustration, based on an air quality data set in Malaysia, several graphics were produced and discussed to illustrate the contribution of the visualisation tools in providing input and the insight on the pattern of data missingness. Based on the results, it is shown that missing values in air quality data set of the chosen sites in Malaysia behave as missing at random (MAR) with small percentage of missingness  and do contain long gap size of  missingness.


Sign in / Sign up

Export Citation Format

Share Document