A Statistical Model for Automated Quality Assessment of the TOAR-II

Author(s):  
Najmeh Kaffashzadeh ◽  
Kai-Lan Chang ◽  
Sabine Schröder ◽  
Martin G. Schultz

<p>The Tropospheric Ozone Assessment Report, phase 2, (TOAR-II) database is a collection of global ground-level ozone in-situ measurements from various locations. It also holds data of selected ozone precursors and meteorological variables. TOAR-II assembles air quality data from many different sources and thus requires a common data quality assessment (QA) to ensure the data meet the quality required for globally consistent analyses. The large volume of this database (more than 100,000 data series) enforces the use of automated, data-driven QA procedures.</p><p>Accordingly, we have developed a statistical model for automated QA. This model consists of several statistical tests that are classified into several sub-groups. In this model, a QA-score (an indicator ranging from 0 to 1) was assigned to each individual data point to estimates the value‘s plausibility. The foundation of this concept is statistical hypothesis testing and the probability theory. This model was implemented in a Python package and is called AutoQA4Env.</p><p>One application of AutoQA4Env is the data ingestion workflow of TOAR-II. The tool generates a data quality report which is then sent back to the data provider for inspection. Since AutoQA4Env is easily configurable, it allows the users to set quality thresholds and thus filter data according to their use case. While we primarily develop AutoQA4Env for air quality data, the same concept and model might be applicable to other databases and the software framework is flexible enough to allow for other use cases.</p>

2021 ◽  
Vol 9 ◽  
Author(s):  
Ågot K. Watne ◽  
Jenny Linden ◽  
Jens Willhelmsson ◽  
Håkan Fridén ◽  
Malin Gustafsson ◽  
...  

Using low-cost air quality sensors (LCS) in citizen science projects opens many possibilities. LCS can provide an opportunity for the citizens to collect and contribute with their own air quality data. However, low data quality is often an issue when using LCS and with it a risk of unrealistic expectations of a higher degree of empowerment than what is possible. If the data quality and intended use of the data is not harmonized, conclusions may be drawn on the wrong basis and data can be rendered unusable. Ensuring high data quality is demanding in terms of labor and resources. The expertise, sensor performance assessment, post-processing, as well as the general workload required will depend strongly on the purpose and intended use of the air quality data. It is therefore a balancing act to ensure that the data quality is high enough for the specific purpose, while minimizing the validation effort. The aim of this perspective paper is to increase awareness of data quality issues and provide strategies to minimizing labor intensity and expenses while maintaining adequate QA/QC for robust applications of LCS in citizen science projects. We believe that air quality measurements performed by citizens can be better utilized with increased awareness about data quality and measurement requirements, in combination with improved metadata collection. Well-documented metadata can not only increase the value and usefulness for the actors collecting the data, but it also the foundation for assessment of potential integration of the data collected by citizens in a broader perspective.


Author(s):  
Ahmad R. Alsaber ◽  
Jiazhu Pan ◽  
Adeeba Al-Hurban 

In environmental research, missing data are often a challenge for statistical modeling. This paper addressed some advanced techniques to deal with missing values in a data set measuring air quality using a multiple imputation (MI) approach. MCAR, MAR, and NMAR missing data techniques are applied to the data set. Five missing data levels are considered: 5%, 10%, 20%, 30%, and 40%. The imputation method used in this paper is an iterative imputation method, missForest, which is related to the random forest approach. Air quality data sets were gathered from five monitoring stations in Kuwait, aggregated to a daily basis. Logarithm transformation was carried out for all pollutant data, in order to normalize their distributions and to minimize skewness. We found high levels of missing values for NO2 (18.4%), CO (18.5%), PM10 (57.4%), SO2 (19.0%), and O3 (18.2%) data. Climatological data (i.e., air temperature, relative humidity, wind direction, and wind speed) were used as control variables for better estimation. The results show that the MAR technique had the lowest RMSE and MAE. We conclude that MI using the missForest approach has a high level of accuracy in estimating missing values. MissForest had the lowest imputation error (RMSE and MAE) among the other imputation methods and, thus, can be considered to be appropriate for analyzing air quality data.


2021 ◽  
Vol 138 ◽  
pp. 104976
Author(s):  
Juan José Díaz ◽  
Ivan Mura ◽  
Juan Felipe Franco ◽  
Raha Akhavan-Tabatabaei

Sign in / Sign up

Export Citation Format

Share Document