scholarly journals AQ-Bench: A Benchmark Dataset for Machine Learning on Global Air Quality Metrics

2021 ◽  
Author(s):  
Clara Betancourt ◽  
Timo Stomberg ◽  
Scarlet Stadtler ◽  
Ribana Roscher ◽  
Martin G. Schultz

Abstract. With the AQ-Bench dataset, we contribute to the recent developments towards shared data usage and machine learning methods in the field of environmental science. The dataset presented here enables researchers to relate global air quality metrics to easy-access metadata and to explore different machine learning methods for obtaining estimates of air quality based on this metadata. AQ-Bench contains a unique collection of aggregated air quality data from the years 2010–2014 and metadata at more than 5500 air quality monitoring stations all over the world, provided by the first Tropospheric Ozone Assessment Report (TOAR). It focuses in particular on metrics of tropospheric ozone, which has a detrimental effect on climate, human morbidity and mortality, as well as crop yields. We validate these data as a machine learning benchmark by providing a well-defined task together with a suitable evaluation metric. Baseline scores obtained from a linear regression method, a fully connected neural network and random forest are provided for reference. AQ-Bench offers a low-threshold entrance for all machine learners with an interest in environmental science and for atmospheric scientists who are interested in applying machine learning techniques. It enables them to start with a real-world problem relevant to humans and nature. The dataset and introductory machine learning code are available at https://doi.org/10.23728/b2share.30d42b5a87344e82855a486bf2123e9f (Betancourt et al., 2020) and https://gitlab.version.fz-juelich.de/toar/ozone-mapping . AQ-Bench thus provides a blueprint for environmental benchmark datasets as well as an example for data re-use according to the FAIR principles.

2021 ◽  
Vol 13 (6) ◽  
pp. 3013-3033
Author(s):  
Clara Betancourt ◽  
Timo Stomberg ◽  
Ribana Roscher ◽  
Martin G. Schultz ◽  
Scarlet Stadtler

Abstract. With the AQ-Bench dataset, we contribute to the recent developments towards shared data usage and machine learning methods in the field of environmental science. The dataset presented here enables researchers to relate global air quality metrics to easy-access metadata and to explore different machine learning methods for obtaining estimates of air quality based on this metadata. AQ-Bench contains a unique collection of aggregated air quality data from the years 2010–2014 and metadata at more than 5500 air quality monitoring stations all over the world, provided by the first Tropospheric Ozone Assessment Report (TOAR). It focuses in particular on metrics of tropospheric ozone, which has a detrimental effect on climate, human morbidity and mortality, as well as crop yields. The purpose of this dataset is to produce estimates of various long-term ozone metrics based on time-independent local site conditions. We combine this task with a suitable evaluation metric. Baseline scores obtained from a linear regression method, a fully connected neural network and random forest are provided for reference and validation. AQ-Bench offers a low-threshold entrance for all machine learners with an interest in environmental science and for atmospheric scientists who are interested in applying machine learning techniques. It enables them to start with a real-world problem relevant to humans and nature. The dataset and introductory machine learning code are available at https://doi.org/10.23728/b2share.30d42b5a87344e82855a486bf2123e9f (Betancourt et al., 2020) and https://gitlab.version.fz-juelich.de/esde/machine-learning/aq-bench (Betancourt et al., 2021). AQ-Bench thus provides a blueprint for environmental benchmark datasets as well as an example for data re-use according to the FAIR principles.


2019 ◽  
Vol 11 (12) ◽  
pp. 1440 ◽  
Author(s):  
Qiangqiang Yuan ◽  
Shuwen Li ◽  
Linwei Yue ◽  
Tongwen Li ◽  
Huanfeng Shen ◽  
...  

Vegetation water content (VWC) is recognized as an important parameter in vegetation growth studies, natural disasters such as forest fires, and drought prediction. Recently, the Global Navigation Satellite System Interferometric Reflectometry (GNSS-IR) has emerged as an important technique for monitoring vegetation information. The normalized microwave reflection index (NMRI) was developed to reflect the change of VWC based on this fact. However, NMRI uses local site-based data, and the sparse distribution hinders the application of NMRI. In this study, we obtained a 500 m spatially continuous NMRI product by integrating GNSS-IR site data with other VWC-related products using the point–surface fusion technique. The auxiliary data in the fusion process include the normalized difference vegetation index (NDVI), gross primary productivity (GPP), and precipitation. Meanwhile, the fusion performance of three machine learning methods, i.e., the back-propagation neural network (BPNN), generalized regression neural network (GRNN), and random forest (RF) are compared and analyzed. The machine learning methods achieve satisfactory results, with cross-validation R values of 0.71–0.83 and RMSEs of 0.025–0.037. The results show a clear improvement over the traditional multiple linear regression method, which achieves R (RMSE) values of only about 0.4 (0.045). It indicates that the machine learning methods can better learn the complex nonlinear relationship between NMRI and the input VWC-related index. Among the machine learning methods, the RF model obtained the best results. Long time-series NMRI images with a 500 m spatial resolution in the western part of the continental U.S. were then obtained. The results show that the spatial distribution of the NMRI product is consistent with a drought situation from 2012 to 2014 in the U.S., which verifies the feasibility of analyzing and predicting drought times and distribution ranges by using the 500 m fusion product.


2016 ◽  
Vol 10 (2) ◽  
pp. 195-211 ◽  
Author(s):  
Huiping Peng ◽  
Aranildo R. Lima ◽  
Andrew Teakles ◽  
Jian Jin ◽  
Alex J. Cannon ◽  
...  

Author(s):  
Bo Liu ◽  
Chao Shi ◽  
Jianqiang Li ◽  
Yong Li ◽  
Jianlei Lang ◽  
...  

2020 ◽  
Vol 2 (2) ◽  
pp. 021005
Author(s):  
Limin Feng ◽  
Ting Yang ◽  
Dawei Wang ◽  
Zifa Wang ◽  
Yuepeng Pan ◽  
...  

2019 ◽  
Vol 252 ◽  
pp. 03009 ◽  
Author(s):  
Tomasz Cieplak ◽  
Tomasz Rymarczyk ◽  
Robert Tomaszewski

This paper presents a concept of the air quality monitoring system design and describes a selection of data quality analysis methods. A high level of industrialisation affects the risk of natural disasters related to environmental pollution such ase.g.air pollution by gases and clouds of dust (carbon monoxide, sulphur oxides, nitrogen oxides). That is why researches related to the monitoring this type of phenomena are extremely important. Low-cost air quality sensors are more commonly used to monitor air parameters in urban areas. These types of sensors are used to obtain an image of the spatiotemporal variability in the concentration of air pollutants. Aside from their low price , which is important from a point of view of the economic accessibility of society, low-cost sensors are prone to produce erroneous results compared to professional air quality monitors. The described study focuses on the analysis of outliers as particularly interesting for further analysis, as well as modelling with machine learning methods for air quality assessment in the city of Lublin.


2021 ◽  
Author(s):  
Clara Betancourt ◽  
Scarlet Stadtler ◽  
Timo Stomberg ◽  
Ann-Kathrin Edrich ◽  
Ankit Patnala ◽  
...  

<p>Through the availability of multi-year ground based ozone observations on a global scale, substantial geospatial meta data, and high performance computing capacities, it is now possible to use machine learning for a global data-driven ozone assessment. In this presentation, we will show a novel, completely data-driven approach to map tropospheric ozone globally.</p><p>Our goal is to interpolate ozone metrics and aggregated statistics from the database of the Tropospheric Ozone Assessment Report (TOAR) onto a global 0.1° x 0.1° resolution grid.  It is challenging to interpolate ozone, a toxic greenhouse gas because its formation depends on many interconnected environmental factors on small scales. We conduct the interpolation with various machine learning methods trained on aggregated hourly ozone data from five years at more than 5500 locations worldwide. We use several geospatial datasets as training inputs to provide proxy input for environmental factors controlling ozone formation, such as precursor emissions and climate. The resulting maps contain different ozone metrics, i.e. statistical aggregations which are widely used to assess air pollution impacts on health, vegetation, and climate.</p><p>The key aspects of this contribution are twofold: First, we apply explainable machine learning methods to the data-driven ozone assessment. Second, we discuss dominant uncertainties relevant to the ozone mapping and quantify their impact whenever possible. Our methods include a thorough a-priori uncertainty estimation of the various data and methods, assessment of scientific consistency, finding critical model parameters, using ensemble methods, and performing error modeling.</p><p>Our work aims to increase the reliability and integrity of the derived ozone maps through the provision of scientific robustness to a data-centric machine learning task. This study hence represents a blueprint for how to formulate an environmental machine learning task scientifically, gather the necessary data, and develop a data-driven workflow that focuses on optimizing transparency and applicability of its product to maximize its scientific knowledge return.</p>


Sign in / Sign up

Export Citation Format

Share Document