A Statistical Model for Automated Quality Assessment of the TOAR-II

The Tropospheric Ozone Assessment Report, phase 2, (TOAR-II) database is a collection of global ground-level ozone in-situ measurements from various locations. It also holds data of selected ozone precursors and meteorological variables. TOAR-II assembles air quality data from many different sources and thus requires a common data quality assessment (QA) to ensure the data meet the quality required for globally consistent analyses. The large volume of this database (more than 100,000 data series) enforces the use of automated, data-driven QA procedures.Accordingly, we have developed a statistical model for automated QA. This model consists of several statistical tests that are classified into several sub-groups. In this model, a QA-score (an indicator ranging from 0 to 1) was assigned to each individual data point to estimates the value&#8216;s plausibility. The foundation of this concept is statistical hypothesis testing and the probability theory. This model was implemented in a Python package and is called AutoQA4Env.One application of AutoQA4Env is the data ingestion workflow of TOAR-II. The tool generates a data quality report which is then sent back to the data provider for inspection. Since AutoQA4Env is easily configurable, it allows the users to set quality thresholds and thus filter data according to their use case. While we primarily develop AutoQA4Env for air quality data, the same concept and model might be applicable to other databases and the software framework is flexible enough to allow for other use cases.

Download Full-text

Air quality data series estimation based on machine learning approaches for urban environments

Air Quality Atmosphere & Health ◽

10.1007/s11869-020-00925-4 ◽

2020 ◽

Author(s):

Alireza Rahimpour ◽

Jamil Amanollahi ◽

Chris G. Tzanis

Keyword(s):

Machine Learning ◽

Air Quality ◽

Urban Environments ◽

Quality Data ◽

Data Series ◽

Learning Approaches ◽

Series Estimation ◽

Air Quality Data

Download Full-text

Tackling Data Quality When Using Low-Cost Air Quality Sensors in Citizen Science Projects

Frontiers in Environmental Science ◽

10.3389/fenvs.2021.733634 ◽

2021 ◽

Vol 9 ◽

Author(s):

Ågot K. Watne ◽

Jenny Linden ◽

Jens Willhelmsson ◽

Håkan Fridén ◽

Malin Gustafsson ◽

...

Keyword(s):

Air Quality ◽

Data Quality ◽

Citizen Science ◽

Low Cost ◽

Quality Data ◽

High Data ◽

Science Projects ◽

Air Quality Measurements ◽

Air Quality Data ◽

Intended Use

Using low-cost air quality sensors (LCS) in citizen science projects opens many possibilities. LCS can provide an opportunity for the citizens to collect and contribute with their own air quality data. However, low data quality is often an issue when using LCS and with it a risk of unrealistic expectations of a higher degree of empowerment than what is possible. If the data quality and intended use of the data is not harmonized, conclusions may be drawn on the wrong basis and data can be rendered unusable. Ensuring high data quality is demanding in terms of labor and resources. The expertise, sensor performance assessment, post-processing, as well as the general workload required will depend strongly on the purpose and intended use of the air quality data. It is therefore a balancing act to ensure that the data quality is high enough for the specific purpose, while minimizing the validation effort. The aim of this perspective paper is to increase awareness of data quality issues and provide strategies to minimizing labor intensity and expenses while maintaining adequate QA/QC for robust applications of LCS in citizen science projects. We believe that air quality measurements performed by citizens can be better utilized with increased awareness about data quality and measurement requirements, in combination with improved metadata collection. Well-documented metadata can not only increase the value and usefulness for the actors collecting the data, but it also the foundation for assessment of potential integration of the data collected by citizens in a broader perspective.

Download Full-text

Assessment & Forecast of Air Quality data based on Neural Network

International Journal of Pharmaceutical Research ◽

10.31838/ijpr/2020.12.01.231 ◽

2020 ◽

Vol 12 (01) ◽

Keyword(s):

Neural Network ◽

Air Quality ◽

Quality Data ◽

Air Quality Data

Download Full-text

Making Air Quality Data Meaningful

Proceedings of the 2020 ACM Designing Interactive Systems Conference ◽

10.1145/3357236.3395517 ◽

2020 ◽

Author(s):

Szu-Yu (Cyn) Liu ◽

Justin Cranshaw ◽

Asta Roseway

Keyword(s):

Air Quality ◽

Quality Data ◽

Air Quality Data

Download Full-text

Feature Selection and Analysis in Air Quality Data

2021 11th International Conference on Cloud Computing, Data Science & Engineering (Confluence) ◽

10.1109/confluence51648.2021.9376882 ◽

2021 ◽

Author(s):

Manish Mahajan ◽

Santosh Kumar ◽

Bhasker Pant ◽

Umesh Kumar Tiwari ◽

Rijwan Khan

Keyword(s):

Feature Selection ◽

Air Quality ◽

Quality Data ◽

Air Quality Data

Download Full-text

Handling Complex Missing Data Using Random Forest Approach for an Air Quality Monitoring Dataset: A Case Study of Kuwait Environmental Data (2012 to 2018)

International Journal of Environmental Research and Public Health ◽

10.3390/ijerph18031333 ◽

2021 ◽

Vol 18 (3) ◽

pp. 1333

Author(s):

Ahmad R. Alsaber ◽

Jiazhu Pan ◽

Adeeba Al-Hurban

Keyword(s):

Air Quality ◽

Missing Data ◽

Random Forest ◽

Missing Values ◽

Imputation Method ◽

Environmental Data ◽

Environmental Research ◽

Quality Data ◽

Data Set ◽

Air Quality Data

In environmental research, missing data are often a challenge for statistical modeling. This paper addressed some advanced techniques to deal with missing values in a data set measuring air quality using a multiple imputation (MI) approach. MCAR, MAR, and NMAR missing data techniques are applied to the data set. Five missing data levels are considered: 5%, 10%, 20%, 30%, and 40%. The imputation method used in this paper is an iterative imputation method, missForest, which is related to the random forest approach. Air quality data sets were gathered from five monitoring stations in Kuwait, aggregated to a daily basis. Logarithm transformation was carried out for all pollutant data, in order to normalize their distributions and to minimize skewness. We found high levels of missing values for NO2 (18.4%), CO (18.5%), PM10 (57.4%), SO2 (19.0%), and O3 (18.2%) data. Climatological data (i.e., air temperature, relative humidity, wind direction, and wind speed) were used as control variables for better estimation. The results show that the MAR technique had the lowest RMSE and MAE. We conclude that MI using the missForest approach has a high level of accuracy in estimating missing values. MissForest had the lowest imputation error (RMSE and MAE) among the other imputation methods and, thus, can be considered to be appropriate for analyzing air quality data.

Download Full-text