Machine learning for air quality prediction using meteorological and traffic related features

2020 ◽  
Vol 12 (5) ◽  
pp. 379-391
Author(s):  
Ihsane Gryech ◽  
Mounir Ghogho ◽  
Hajar Elhammouti ◽  
Nada Sbihi ◽  
Abdellatif Kobbane

The presence of pollutants in the air has a direct impact on our health and causes detrimental changes to our environment. Air quality monitoring is therefore of paramount importance. The high cost of the acquisition and maintenance of accurate air quality stations implies that only a small number of these stations can be deployed in a country. To improve the spatial resolution of the air monitoring process, an interesting idea is to develop data-driven models to predict air quality based on readily available data. In this paper, we investigate the correlations between air pollutants concentrations and meteorological and road traffic data. Using machine learning, regression models are developed to predict pollutants concentration. Both linear and non-linear models are investigated in this paper. It is shown that non-linear models, namely Random Forest (RF) and Support Vector Regression (SVR), better describe the impact of traffic flows and meteorology on the concentrations of pollutants in the atmosphere. It is also shown that more accurate prediction models can be obtained when including some pollutants’ concentration as predictors. This may be used to infer the concentrations of some pollutants using those of other pollutants, thereby reducing the number of air pollution sensors.

2020 ◽  
Vol 10 (24) ◽  
pp. 9151
Author(s):  
Yun-Chia Liang ◽  
Yona Maimury ◽  
Angela Hsiang-Ling Chen ◽  
Josue Rodolfo Cuevas Juarez

Air, an essential natural resource, has been compromised in terms of quality by economic activities. Considerable research has been devoted to predicting instances of poor air quality, but most studies are limited by insufficient longitudinal data, making it difficult to account for seasonal and other factors. Several prediction models have been developed using an 11-year dataset collected by Taiwan’s Environmental Protection Administration (EPA). Machine learning methods, including adaptive boosting (AdaBoost), artificial neural network (ANN), random forest, stacking ensemble, and support vector machine (SVM), produce promising results for air quality index (AQI) level predictions. A series of experiments, using datasets for three different regions to obtain the best prediction performance from the stacking ensemble, AdaBoost, and random forest, found the stacking ensemble delivers consistently superior performance for R2 and RMSE, while AdaBoost provides best results for MAE.


2021 ◽  
Vol 11 (13) ◽  
pp. 6030
Author(s):  
Daljeet Singh ◽  
Antonella B. Francavilla ◽  
Simona Mancini ◽  
Claudio Guarnaccia

A vehicular road traffic noise prediction methodology based on machine learning techniques has been presented. The road traffic parameters that have been considered are traffic volume, percentage of heavy vehicles, honking occurrences and the equivalent continuous sound pressure level. Leq A method to include the honking effect in the traffic noise prediction has been illustrated. The techniques that have been used for the prediction of traffic noise are decision trees, random forests, generalized linear models and artificial neural networks. The results obtained by using these methods have been compared on the basis of mean square error, correlation coefficient, coefficient of determination and accuracy. It has been observed that honking is an important parameter and contributes to the overall traffic noise, especially in congested Indian road traffic conditions. The effects of honking noise on the human health cannot be ignored and it should be included as a parameter in the future traffic noise prediction models.


2021 ◽  
Author(s):  
Sebastian Johannes Fritsch ◽  
Konstantin Sharafutdinov ◽  
Moein Einollahzadeh Samadi ◽  
Gernot Marx ◽  
Andreas Schuppert ◽  
...  

BACKGROUND During the course of the COVID-19 pandemic, a variety of machine learning models were developed to predict different aspects of the disease, such as long-term causes, organ dysfunction or ICU mortality. The number of training datasets used has increased significantly over time. However, these data now come from different waves of the pandemic, not always addressing the same therapeutic approaches over time as well as changing outcomes between two waves. The impact of these changes on model development has not yet been studied. OBJECTIVE The aim of the investigation was to examine the predictive performance of several models trained with data from one wave predicting the second wave´s data and the impact of a pooling of these data sets. Finally, a method for comparison of different datasets for heterogeneity is introduced. METHODS We used two datasets from wave one and two to develop several predictive models for mortality of the patients. Four classification algorithms were used: logistic regression (LR), support vector machine (SVM), random forest classifier (RF) and AdaBoost classifier (ADA). We also performed a mutual prediction on the data of that wave which was not used for training. Then, we compared the performance of models when a pooled dataset from two waves was used. The populations from the different waves were checked for heterogeneity using a convex hull analysis. RESULTS 63 patients from wave one (03-06/2020) and 54 from wave two (08/2020-01/2021) were evaluated. For both waves separately, we found models reaching sufficient accuracies up to 0.79 AUROC (95%-CI 0.76-0.81) for SVM on the first wave and up 0.88 AUROC (95%-CI 0.86-0.89) for RF on the second wave. After the pooling of the data, the AUROC decreased relevantly. In the mutual prediction, models trained on second wave´s data showed, when applied on first wave´s data, a good prediction for non-survivors but an insufficient classification for survivors. The opposite situation (training: first wave, test: second wave) revealed the inverse behaviour with models correctly classifying survivors and incorrectly predicting non-survivors. The convex hull analysis for the first and second wave populations showed a more inhomogeneous distribution of underlying data when compared to randomly selected sets of patients of the same size. CONCLUSIONS Our work demonstrates that a larger dataset is not a universal solution to all machine learning problems in clinical settings. Rather, it shows that inhomogeneous data used to develop models can lead to serious problems. With the convex hull analysis, we offer a solution for this problem. The outcome of such an analysis can raise concerns if the pooling of different datasets would cause inhomogeneous patterns preventing a better predictive performance.


2021 ◽  
Author(s):  
Wei Qiu ◽  
Hugh Chen ◽  
Ayse Berceste Dincer ◽  
Su-In Lee

AbstractExplainable artificial intelligence provides an opportunity to improve prediction accuracy over standard linear models using “black box” machine learning (ML) models while still revealing insights into a complex outcome such as all-cause mortality. We propose the IMPACT (Interpretable Machine learning Prediction of All-Cause morTality) framework that implements and explains complex, non-linear ML models in epidemiological research, by combining a tree ensemble mortality prediction model and an explainability method. We use 133 variables from NHANES 1999–2014 datasets (number of samples: n = 47, 261) to predict all-cause mortality. To explain our model, we extract local (i.e., per-sample) explanations to verify well-studied mortality risk factors, and make new discoveries. We present major factors for predicting x-year mortality (x = 1, 3, 5) across different age groups and their individualized impact on mortality prediction. Moreover, we highlight interactions between risk factors associated with mortality prediction, which leads to findings that linear models do not reveal. We demonstrate that compared with traditional linear models, tree-based models have unique strengths such as: (1) improving prediction power, (2) making no distribution assumptions, (3) capturing non-linear relationships and important thresholds, (4) identifying feature interactions, and (5) detecting different non-linear relationships between models. Given the popularity of complex ML models in prognostic research, combining these models with explainability methods has implications for further applications of ML in medical fields. To our knowledge, this is the first study that combines complex ML models and state-of-the-art feature attributions to explain mortality prediction, which enables us to achieve higher prediction accuracy and gain new insights into the effect of risk factors on mortality.


2021 ◽  
Vol 7 ◽  
pp. e746
Author(s):  
Muhammad Naeem ◽  
Jian Yu ◽  
Muhammad Aamir ◽  
Sajjad Ahmad Khan ◽  
Olayinka Adeleye ◽  
...  

Background Forecasting the time of forthcoming pandemic reduces the impact of diseases by taking precautionary steps such as public health messaging and raising the consciousness of doctors. With the continuous and rapid increase in the cumulative incidence of COVID-19, statistical and outbreak prediction models including various machine learning (ML) models are being used by the research community to track and predict the trend of the epidemic, and also in developing appropriate strategies to combat and manage its spread. Methods In this paper, we present a comparative analysis of various ML approaches including Support Vector Machine, Random Forest, K-Nearest Neighbor and Artificial Neural Network in predicting the COVID-19 outbreak in the epidemiological domain. We first apply the autoregressive distributed lag (ARDL) method to identify and model the short and long-run relationships of the time-series COVID-19 datasets. That is, we determine the lags between a response variable and its respective explanatory time series variables as independent variables. Then, the resulting significant variables concerning their lags are used in the regression model selected by the ARDL for predicting and forecasting the trend of the epidemic. Results Statistical measures—Root Mean Square Error (RMSE), Mean Absolute Error (MAE), Mean Absolute Percentage Error (MAPE) and Symmetric Mean Absolute Percentage Error (SMAPE)—are used for model accuracy. The values of MAPE for the best-selected models for confirmed, recovered and deaths cases are 0.003, 0.006 and 0.115, respectively, which falls under the category of highly accurate forecasts. In addition, we computed 15 days ahead forecast for the daily deaths, recovered, and confirm patients and the cases fluctuated across time in all aspects. Besides, the results reveal the advantages of ML algorithms for supporting the decision-making of evolving short-term policies.


Molecules ◽  
2020 ◽  
Vol 25 (20) ◽  
pp. 4696
Author(s):  
Ștefan-Mihai Petrea ◽  
Mioara Costache ◽  
Dragoș Cristea ◽  
Ștefan-Adrian Strungaru ◽  
Ira-Adeline Simionov ◽  
...  

Metals are considered to be one of the most hazardous substances due to their potential for accumulation, magnification, persistence, and wide distribution in water, sediments, and aquatic organisms. Demersal fish species, such as turbot (Psetta maxima maeotica), are accepted by the scientific communities as suitable bioindicators of heavy metal pollution in the aquatic environment. The present study uses a machine learning approach, which is based on multiple linear and non-linear models, in order to effectively estimate the concentrations of heavy metals in both turbot muscle and liver tissues. For multiple linear regression (MLR) models, the stepwise method was used, while non-linear models were developed by applying random forest (RF) algorithm. The models were based on data that were provided from scientific literature, attributed to 11 heavy metals (As, Ca, Cd, Cu, Fe, K, Mg, Mn, Na, Ni, Zn) from both muscle and liver tissues of turbot exemplars. Significant MLR models were recorded for Ca, Fe, Mg, and Na in muscle tissue and K, Cu, Zn, and Na in turbot liver tissue. The non-linear tree-based RF prediction models (over 70% prediction accuracy) were identified for As, Cd, Cu, K, Mg, and Zn in muscle tissue and As, Ca, Cd, Mg, and Fe in turbot liver tissue. Both machine learning MLR and non-linear tree-based RF prediction models were identified to be suitable for predicting the heavy metal concentration from both turbot muscle and liver tissues. The models can be used for improving the knowledge and economic efficiency of linked heavy metals food safety and environment pollution studies.


Air is the most essential natural resource for the survival of humans, animals, and plants on the planet. Air is polluted due to the burning of fuels, exhaust gases from factories and industries, and mining operations. Now, air pollution becomes the most dangerous pollution that humanity ever faced. This causes many health effects on humans like respiratory, lung, and skin diseases, which also causes effects on plants, and animals to survive. Hence, air quality prediction and evaluation as becoming an important research area. In this paper, a machine learning-based prediction model is constructed for air quality forecasting. This model will help us to find the major pollutant present in the location along with the causes and sources of that particular pollutant. Air Quality Index value for India is used to predict air quality. The data is collected from various places throughout India so that the collected data is preprocessed to recover from null values, missing values, and duplicate values. The dataset is trained and tested with various machine learning algorithms like Logistic Regression, Naïve Bayes Classification, Random Forest, Support Vector Machine, K Nearest Neighbor, and Decision Tree algorithm in order to find the performance measurement of the above-mentioned algorithms. From this, the prediction model is constructed using the Decision Tree algorithm to predict the air quality, because it provides the best and highest accuracy of 100%. The machine learning-based air quality prediction model helps India meteorological department in predicting the future of air quality, and its status and depends on that they can take action.


2019 ◽  
Vol 29 (1) ◽  
pp. 61-80
Author(s):  
Sen Chai ◽  
Alexander D’Amour ◽  
Lee Fleming

Abstract Following widespread availability of computerized databases, much research has correlated bibliometric measures from papers or patents to subsequent success, typically measured as the number of publications or citations. Building on this large body of work, we ask the following questions: given available bibliometric information in one year, along with the combined theories on sources of creative breakthroughs from the literatures on creativity and innovation, how accurately can we explain the impact of authors in a given research community in the following year? In particular, who is most likely to publish, publish highly cited work, and even publish a highly cited outlier? And, how accurately can these existing theories predict breakthroughs using only contemporaneous data? After reviewing and synthesizing (often competing) theories from the literatures, we simultaneously model the collective hypotheses based on available data in the year before RNA interference was discovered. We operationalize author impact using publication count, forward citations, and the more stringent definition of being in the top decile of the citation distribution. Explanatory power of current theories altogether ranges from less than 9% for being top cited to 24% for productivity. Machine learning (ML) methods yield similar findings as the explanatory linear models, and tangible improvement only for non-linear Support Vector Machine models. We also perform predictions using only existing data until 1997, and find lower predictability than using explanatory models. We conclude with an agenda for future progress in the bibliometric study of creativity and look forward to ML research that can explain its models.


2018 ◽  
Vol 2 (1) ◽  
pp. 5 ◽  
Author(s):  
Dixian Zhu ◽  
Changjie Cai ◽  
Tianbao Yang ◽  
Xun Zhou

In this paper, we tackle air quality forecasting by using machine learning approaches to predict the hourly concentration of air pollutants (e.g., ozone, particle matter ( PM 2.5 ) and sulfur dioxide). Machine learning, as one of the most popular techniques, is able to efficiently train a model on big data by using large-scale optimization algorithms. Although there exist some works applying machine learning to air quality prediction, most of the prior studies are restricted to several-year data and simply train standard regression models (linear or nonlinear) to predict the hourly air pollution concentration. In this work, we propose refined models to predict the hourly air pollution concentration on the basis of meteorological data of previous days by formulating the prediction over 24 h as a multi-task learning (MTL) problem. This enables us to select a good model with different regularization techniques. We propose a useful regularization by enforcing the prediction models of consecutive hours to be close to each other and compare it with several typical regularizations for MTL, including standard Frobenius norm regularization, nuclear norm regularization, and ℓ 2 , 1 -norm regularization. Our experiments have showed that the proposed parameter-reducing formulations and consecutive-hour-related regularizations achieve better performance than existing standard regression models and existing regularizations.


Generally, Air pollution alludes to the issue of toxins into the air that are harmful to human well being and the entire planet. It can be described as one of the most dangerous threats that the humanity ever faced. It causes damage to animals, crops, forests etc. To prevent this problem in transport sectors have to predict air quality from pollutants using machine learning techniques. Subsequently, air quality assessment and prediction has turned into a significant research zone. The aim is to investigate machine learning based techniques for air quality prediction. The air quality dataset is preprocessed with respect to univariate analysis, bi-variate and multi-variate analysis, missing value treatments, data validation, data cleaning/preparing. Then, air quality is predicted using supervised machine learning techniques like Logistic Regression, Random Forest, K-Nearest Neighbors, Decision Tree and Support Vector Machines. The performance of various machine learning algorithms is compared with respect to Precision, Recall and F1 Score. It is found that Decision Tree algorithm works well for predicting air quality. This application can help the meteorological Department in predicting air quality. In future, this work can be optimized by applying Artificial Intelligence techniques.


Sign in / Sign up

Export Citation Format

Share Document