A Novel Method for Air Quality Data Imputation by Nuclear Norm Minimization

Missing data is a frequently encountered problem in environment research community. To facilitate the analysis and management of air quality data, for example, PM2.5concentration in this study, a commonly adopted strategy for handling missing values in the samples is to generate a complete data set using imputation methods. Many imputation methods based on temporal or spatial correlation have been developed for this purpose in the existing literatures. The difference of various methods lies in characterizing the dependence relationship of data samples with different mathematical models, which is crucial for missing data imputation. In this paper, we propose two novel and principled imputation methods based on the nuclear norm of a matrix since it measures such dependence in a global fashion. The first method, termed as global nuclear norm minimization (GNNM), tries to impute missing values through directly minimizing the nuclear norm of the whole sample matrix, thus at the same time maximizing the linear dependence of samples. The second method, called local nuclear norm minimization (LNNM), concentrates more on each sample and its most similar samples which are estimated from the imputation results of the first method. In such a way, the nuclear norm minimization can be performed on those highly correlated samples instead of the whole sample matrix as in GNNM, thus reducing the adverse impact of irrelevant samples. The two methods are evaluated on a data set of PM2.5concentration measured every 1 h by 22 monitoring stations. The missing values are simulated with different percentages. The imputed values are compared with the ground truth values to evaluate the imputation performance of different methods. The experimental results verify the effectiveness of our methods, especially LNNM, for missing air quality data imputation.

Download Full-text

Handling Complex Missing Data Using Random Forest Approach for an Air Quality Monitoring Dataset: A Case Study of Kuwait Environmental Data (2012 to 2018)

International Journal of Environmental Research and Public Health ◽

10.3390/ijerph18031333 ◽

2021 ◽

Vol 18 (3) ◽

pp. 1333

Author(s):

Ahmad R. Alsaber ◽

Jiazhu Pan ◽

Adeeba Al-Hurban

Keyword(s):

Air Quality ◽

Missing Data ◽

Random Forest ◽

Missing Values ◽

Imputation Method ◽

Environmental Data ◽

Environmental Research ◽

Quality Data ◽

Data Set ◽

Air Quality Data

In environmental research, missing data are often a challenge for statistical modeling. This paper addressed some advanced techniques to deal with missing values in a data set measuring air quality using a multiple imputation (MI) approach. MCAR, MAR, and NMAR missing data techniques are applied to the data set. Five missing data levels are considered: 5%, 10%, 20%, 30%, and 40%. The imputation method used in this paper is an iterative imputation method, missForest, which is related to the random forest approach. Air quality data sets were gathered from five monitoring stations in Kuwait, aggregated to a daily basis. Logarithm transformation was carried out for all pollutant data, in order to normalize their distributions and to minimize skewness. We found high levels of missing values for NO2 (18.4%), CO (18.5%), PM10 (57.4%), SO2 (19.0%), and O3 (18.2%) data. Climatological data (i.e., air temperature, relative humidity, wind direction, and wind speed) were used as control variables for better estimation. The results show that the MAR technique had the lowest RMSE and MAE. We conclude that MI using the missForest approach has a high level of accuracy in estimating missing values. MissForest had the lowest imputation error (RMSE and MAE) among the other imputation methods and, thus, can be considered to be appropriate for analyzing air quality data.

Download Full-text

A Comparison of Various Imputation Methods for Missing Values in Air Quality Data

Sains Malaysiana ◽

10.17576/jsm-2015-4403-17 ◽

2015 ◽

Vol 44 (3) ◽

pp. 449-456 ◽

Cited By ~ 22

Author(s):

Nuryazmin Ahmat Zainuri ◽

Abdul Aziz Jemain ◽

Nora Muda

Keyword(s):

Air Quality ◽

Missing Values ◽

Quality Data ◽

Imputation Methods ◽

Air Quality Data

Download Full-text

Missing data exploration in air quality data set using R-package data visualisation tools

Bulletin of Electrical Engineering and Informatics ◽

10.11591/eei.v9i2.2088 ◽

2020 ◽

Vol 9 (2) ◽

pp. 755-763

Author(s):

Shamihah Muhammad Ghazali ◽

Norshahida Shaadan ◽

Zainura Idrus

Keyword(s):

Air Quality ◽

Missing Data ◽

Missing Values ◽

Missing At Random ◽

Data Exploration ◽

Quality Data ◽

Gap Size ◽

Data Visualisation ◽

Data Set ◽

Air Quality Data

Missing values often occur in many data sets of various research areas. This has been recognized as data quality problem because missing values could affect the performance of analysis results. To overcome the problem, the incomplete data set need to be treated or replaced using imputation method. Thus, exploring missing values pattern must be conducted beforehand to determine a suitable method. This paper discusses on the application of data visualisation as a smart technique for missing data exploration aiming to increase understanding on missing data behaviour which include missing data mechanism (MCAR, MAR and MNAR), distribution pattern of missingness in terms of percentage as well as the gap size. This paper presents the application of several data visualisation tools from five R-packges such as visdat, VIM, ggplot2, Amelia and UpSetR for data missingness exploration. For an illustration, based on an air quality data set in Malaysia, several graphics were produced and discussed to illustrate the contribution of the visualisation tools in providing input and the insight on the pattern of data missingness. Based on the results, it is shown that missing values in air quality data set of the chosen sites in Malaysia behave as missing at random (MAR) with small percentage of missingness and do contain long gap size of missingness.

Download Full-text

Air quality data pre-processing: A novel algorithm to impute missing values in univariate time series

10.1109/ictai52525.2021.00159 ◽

2021 ◽

Author(s):

Lakmini Wijesekara ◽

Liwan Liyanage

Keyword(s):

Time Series ◽

Air Quality ◽

Missing Values ◽

Quality Data ◽

Univariate Time Series ◽

Air Quality Data ◽

Novel Algorithm

Download Full-text

Analysis of an air quality data set

Air Pollution ◽

10.4324/9780203476024_chapter_7 ◽

2010 ◽

pp. 308-323

Keyword(s):

Air Quality ◽

Quality Data ◽

Data Set ◽

Air Quality Data

Download Full-text

Analysis of an air quality data set

Air Pollution ◽

10.4324/9780203476024-12 ◽

2002 ◽

pp. 334-349

Keyword(s):

Air Quality ◽

Quality Data ◽

Data Set ◽

Air Quality Data

Download Full-text

Spectrum estimation with missing values: A regularized nuclear norm minimization approach

International Journal of Wavelets Multiresolution and Information Processing ◽

10.1142/s0219691316500545 ◽

2016 ◽

Vol 14 (06) ◽

pp. 1650054 ◽

Cited By ~ 3

Author(s):

Hüseyin Akçay ◽

Semiha Türkay

Keyword(s):

Missing Values ◽

Estimation Problem ◽

Nuclear Norm ◽

Identification Algorithm ◽

Spectrum Estimation ◽

Nuclear Norm Minimization ◽

Amplitude Noise ◽

Norm Minimization ◽

Uniform Grids ◽

Low Pass

In this paper, we consider estimation of a power spectrum from noise corrupted spectrum samples on uniform grids of frequencies with missing values. We propose two schemes based on the regularized nuclear norm minimization in combination with a recent subspace identification algorithm. The proposed schemes estimate the model order and the missing spectrum values in one step and are robust to large amplitude noise over short data records. Although this estimation problem can be cast as a spectrum estimation problem from nonuniformly spaced measurements and the algorithms developed for this type of data can be used, the identification example of this paper shows that the incomplete data formulation yields more accurate results. The properties of one of the proposed schemes are illustrated in an application example concerned with low-pass modeling of transformer current.

Download Full-text

Comparative Study for Outlier Detection In Air Quality Data Set

International Journal of Emerging Trends in Engineering Research ◽

10.30534/ijeter/2019/297112019 ◽

2019 ◽

Vol 7 (11) ◽

pp. 584-592

Author(s):

Devi Afriyantari Puspa Putri ◽

Keyword(s):

Air Quality ◽

Comparative Study ◽

Outlier Detection ◽

Quality Data ◽

Data Set ◽

Air Quality Data

Download Full-text

A Comparative Study of Several EOF Based Imputation Methods for Long Gap Missing Values in a Single-Site Temporal Time Dependent (SSTTD) Air Quality (PM10) Dataset

Pertanika Journal of Science and Technology ◽

10.47836/pjst.29.4.21 ◽

2021 ◽

Vol 29 (4) ◽

Author(s):

Shamihah Muhammad Ghazali ◽

Norshahida Shaadan ◽

Zainura Idrus

Keyword(s):

Air Quality ◽

Missing Values ◽

Single Site ◽

Time Dependent ◽

Environmental Research ◽

Data Set ◽

Imputation Methods ◽

Long Gap ◽

Expectation Maximisation ◽

Gap Sequences

Missing values are often a major problem in many scientific fields of environmental research, leading to prediction inaccuracy and biased analysis results. This study compares the performance of existing Empirical Orthogonal Functions (EOF) based imputation methods. The EOF mean centred approach (EOF-mean) with several proposed EOF based methods, which include the EOF-median, EOF-trimmean and the newly applied Regularised Expectation-Maximisation Principal Component Analysis based method, namely R-EMPCA in estimating missing values for long gap sequence of missing values problem that exists in a Single Site Temporal Time-Dependent (SSTTD) multivariate structure air quality (PM10) data set. The study was conducted using real PM10 data set from the Klang air quality monitoring station. Performance assessment and evaluation of the methods were conducted via a simulation plan which was carried out according to four percentages (5, 10, 20 and 30) of missing values with respect to several long gap sequences (12, 24, 168 and 720) of missing points (hours). Based on several performance indicators such as RMSE, MAE, R-Square and AI, the results have shown that R-EMPCA outperformed the other methods. The results also conclude that the proposed EOF-median and EOF-trimmean have better performance than the existing EOF-mean based method in which EOF-trimmean is the best among the three. The methodology and findings of this study contribute as a solution to the problem of missing values with long gap sequences for the SSTTD data set.

Download Full-text

Temperature Estimation with Time Series Analysis from Air Quality Data Set

2019 7th International Symposium on Digital Forensics and Security (ISDFS) ◽

10.1109/isdfs.2019.8757524 ◽

2019 ◽

Author(s):

Zeynep OZPOLAT ◽

Murat KARABATAK

Keyword(s):

Time Series ◽

Air Quality ◽

Time Series Analysis ◽

Quality Data ◽

Temperature Estimation ◽

Data Set ◽

Series Analysis ◽

Air Quality Data

Download Full-text