ML-SWAN-v1: a hybrid machine learning framework for the concentration prediction and discovery of transport pathways of surface water nutrients

Abstract. Nutrient data from catchments discharging to receiving waters are monitored for catchment management. However, nutrient data are often sparse in time and space and have non-linear responses to environmental factors, making it difficult to systematically analyse long- and short-term trends and undertake nutrient budgets. To address these challenges, we developed a hybrid machine learning (ML) framework that first separated baseflow and quickflow from total flow, generated data for missing nutrient species, and then utilised the pre-generated nutrient data as additional variables in a final simulation of tributary water quality. Hybrid random forest (RF) and gradient boosting machine (GBM) models were employed and their performance compared with a linear model, a multivariate weighted regression model, and stand-alone RF and GBM models that did not pre-generate nutrient data. The six models were used to predict six different nutrients discharged from two study sites in Western Australia: Ellen Brook (small and ephemeral) and the Murray River (large and perennial). Our results showed that the hybrid RF and GBM models had significantly higher accuracy and lower prediction uncertainty for almost all nutrient species across the two sites. The pre-generated nutrient and hydrological data were highlighted as the most important components of the hybrid model. The model results also indicated different hydrological transport pathways for total nitrogen (TN) export from two tributary catchments. We demonstrated that the hybrid model provides a flexible method to combine data of varied resolution and quality and is accurate for the prediction of responses of surface water nutrient concentrations to hydrologic variability.

Download Full-text

ML-SWAN-v1: a hybrid machine learning framework for the prediction of daily surface water nutrient concentrations

10.5194/gmd-2020-4 ◽

2020 ◽

Author(s):

Benya Wang ◽

Matthew R. Hipsey ◽

Carolyn Oldham

Keyword(s):

Machine Learning ◽

Water Quality ◽

Surface Water ◽

Nutrient Concentrations ◽

Gradient Boosting ◽

Murray River ◽

Hybrid Machine ◽

Study Sites ◽

Almost All ◽

Nutrient Data

Abstract. Nutrient data from catchments discharging to receiving waters are necessary to monitor and manage water quality, however, they are often sparse in time and space and have non-linear responses to environmental factors, making it difficult to systematically analyse long- and short-term trends and undertake nutrient budgets. To address these challenges, we developed a hybrid machine learning (ML) framework that first separated baseflow and quickflow from total flow, and then generated data for missing nutrient species, using relationships with hydrological data, rainfall, and temporal data. The generated nutrient data were then included as additional variables in a final simulation of tributary water quality. Hybrid random forest (RF) and gradient boosting machines (GBM) models were employed and their performance compared with a linear model, a multivariate weighted regression model and stand-alone RF and GBM models that did not pre-generate nutrient data. The six models were used to predict TN, TP, NH3, dissolved organic carbon (DOC), dissolved organic nitrogen (DON), and filterable reactive phosphorus (FRP) discharged from two study sites in Western Australia: Ellen Brook (small and ephemeral) and the Murray River (large and perennial). Our results showed that the hybrid RF and GBM models had significantly higher accuracy and lower prediction uncertainty for almost all nutrient species across the two sites. We demonstrated that the hybrid model provides a flexible method to combine data of varied resolution and quality, and is accurate for the prediction of responses of surface water nutrient concentrations to hydrologic variability.

Download Full-text

Estimation of PM<sub>2.5</sub> Concentration in China Using Linear Hybrid Machine Learning Model

10.5194/amt-2021-64 ◽

2021 ◽

Author(s):

Zhihao Song ◽

Bin Chen ◽

Yue Huang ◽

Li Dong ◽

Tingting Yang

Keyword(s):

Machine Learning ◽

Hybrid Model ◽

Mixed Model ◽

Meteorological Data ◽

Learning Model ◽

Average Concentration ◽

Gradient Boosting ◽

Spatiotemporal Resolution ◽

Machine Learning Model ◽

Hybrid Machine

Abstract. The satellite remote-sensing aerosol optical depth (AOD) and meteorological elements were employed to invert PM2.5 in order to control air pollution more effectively. This paper proposes a restricted gradient-descent linear hybrid machine learning model (RGD–LHMLM) by integrating a random forest (RF), a gradient boosting regression tree (GBRT), and a deep neural network (DNN) to estimate the concentration of PM2.5 in China in 2019. The research data included Himawari-8 AOD with high spatiotemporal resolution, ERA-5 meteorological data, and geographic information. The results showed that, in the hybrid model developed by linear fitting, the DNN accounted for the largest proportion, whereas the weight coefficient was 0.62. The R2 values of RF, GBRT, and DNN were reported 0.79, 0.81, and 0.8, respectively. Preferably, the generalization ability of the mixed model was better than that of each sub-model, and R2 reached 0.84, whereas RMSE and MAE were reported 12.92 µg/m3 and 8.01 µg/m3, respectively. For the RGD-LHMLM, R2 was above 0.7 in more than 70 % of the sites, whereas RMSE and MAE were below 20 µg/m3 and 15 µg/m3, respectively, in more than 70 % of the sites due to the correlation coefficient having seasonal difference between the meteorological factor and PM2.5. Furthermore, the hybrid model performed best in winter (mean R2 was 0.84) and worst in summer (mean R2 was 0.71). The spatiotemporal distribution characteristics of PM2.5 in China were then estimated and analyzed. According to the results, there was severe pollution in winter with an average concentration of PM2.5 being reported 62.10 µg/m3. However, there was slight pollution in summer with an average concentration of PM2.5 being reported 47.39 µg/m3. The findings also indicate that North China and East China are more polluted than other areas and that their average annual concentration of PM2.5 was reported 82.68 µg/m3. Moreover, there was relatively low pollution in Inner Mongolia, Qinghai, and Tibet, for their average PM2.5 concentrations were reported below 40 µg/m3.

Download Full-text

Estimation of PM<sub>2.5</sub> concentration in China using linear hybrid machine learning model

Atmospheric Measurement Techniques ◽

10.5194/amt-14-5333-2021 ◽

2021 ◽

Vol 14 (8) ◽

pp. 5333-5347

Author(s):

Zhihao Song ◽

Bin Chen ◽

Yue Huang ◽

Li Dong ◽

Tingting Yang

Keyword(s):

Machine Learning ◽

Hybrid Model ◽

Mixed Model ◽

Meteorological Data ◽

Learning Model ◽

Average Concentration ◽

Gradient Boosting ◽

Spatiotemporal Resolution ◽

Machine Learning Model ◽

Hybrid Machine

Abstract. Satellite remote sensing aerosol optical depth (AOD) and meteorological elements were employed to invert PM2.5 (the fine particulate matter with a diameter below 2.5 µm) in order to control air pollution more effectively. This paper proposes a restricted gradient-descent linear hybrid machine learning model (RGD-LHMLM) by integrating a random forest (RF), a gradient boosting regression tree (GBRT), and a deep neural network (DNN) to estimate the concentration of PM2.5 in China in 2019. The research data included Himawari-8 AOD with high spatiotemporal resolution, ERA5 meteorological data, and geographic information. The results showed that, in the hybrid model developed by linear fitting, the DNN accounted for the largest proportion, and the weight coefficient was 0.62. The R2 values of RF, GBRT, and DNN were reported as 0.79, 0.81, and 0.8, respectively. Preferably, the generalization ability of the mixed model was better than that of each sub-model, and R2 (determination coefficient) reached 0.84, and RMSE (root mean square error) and MAE (mean absolute error) were reported as 12.92 and 8.01 µg m−3, respectively. For the RGD-LHMLM, R2 was above 0.7 in more than 70 % of the sites and RMSE and MAE were below 20 and 15 µg m−3, respectively, in more than 70 % of the sites due to the correlation coefficient having a seasonal difference between the meteorological factor and PM2.5. Furthermore, the hybrid model performed best in winter (mean R2 was 0.84) and worst in summer (mean R2 was 0.71). The spatiotemporal distribution characteristics of PM2.5 in China were then estimated and analyzed. According to the results, there was severe pollution in winter with an average concentration of PM2.5 being reported as 62.10 µg m−3. However, there was only slight pollution in summer with an average concentration of PM2.5 being reported as 47.39 µg m−3. The period from 10:00 to 15:00 LT (Beijing time, UTC+8 every day is the best time for model inversion; at this time the pollution is also high. The findings also indicate that North China and East China are more polluted than other areas, and their average annual concentration of PM2.5 was reported as 82.68 µg m−3. Moreover, there was relatively low pollution in Inner Mongolia, Qinghai, and Tibet, for their average PM2.5 concentrations were reported below 40 µg m−3.

Download Full-text

Development of a Diabetes Melitus Detection and Prediction Model Using Light Gradient Boosting Machine and K-Nearest Neighbour

10.36108/ujees/1202.30.0160 ◽

2021 ◽

Vol 3 (1) ◽

Author(s):

B. A Omodunbi

Keyword(s):

Diabetes Mellitus ◽

Machine Learning ◽

Hybrid Model ◽

Learning Model ◽

Experimental Result ◽

Gradient Boosting ◽

Light Gradient ◽

Machine Learning Model ◽

Gradient Boosting Machine ◽

Receiver Operating

Diabetes mellitus is a health disorder that occurs when the blood sugar level becomes extremely high due to body resistance in producing the required amount of insulin. The aliment happens to be among the major causes of death in Nigeria and the world at large. This study was carried out to detect diabetes mellitus by developing a hybrid model that comprises of two machine learning model namely Light Gradient Boosting Machine (LGBM) and K-Nearest Neighbor (KNN). This research is aimed at developing a machine learning model for detecting the occurrence of diabetes in patients. The performance metrics employed in evaluating the finding for this study are Receiver Operating Characteristics (ROC) Curve, Five-fold Cross-validation, precision, and accuracy score. The proposed system had an accuracy of 91% and the area under the Receiver Operating Characteristic Curve was 93%. The experimental result shows that the prediction accuracy of the hybrid model is better than traditional machine learning

Download Full-text

Assessment and Comparison of Six Machine Learning Models in Estimating Evapotranspiration over Croplands Using Remote Sensing and Meteorological Factors

Remote Sensing ◽

10.3390/rs13193838 ◽

2021 ◽

Vol 13 (19) ◽

pp. 3838

Author(s):

Yan Liu ◽

Sha Zhang ◽

Jiahua Zhang ◽

Lili Tang ◽

Yun Bai

Keyword(s):

Machine Learning ◽

Remote Sensing ◽

Hybrid Model ◽

Regional Scale ◽

Hybrid Models ◽

Gradient Boosting ◽

Accurate Estimation ◽

Support Vector ◽

Extreme Gradient Boosting ◽

Monteith Equation

Accurate estimates of evapotranspiration (ET) over croplands on a regional scale can provide useful information for agricultural management. The hybrid ET model that combines the physical framework, namely the Penman-Monteith equation and machine learning (ML) algorithms, have proven to be effective in ET estimates. However, few studies compared the performances in estimating ET between multiple hybrid model versions using different ML algorithms. In this study, we constructed six different hybrid ET models based on six classical ML algorithms, namely the K nearest neighbor algorithm, random forest, support vector machine, extreme gradient boosting algorithm, artificial neural network (ANN) and long short-term memory (LSTM), using observed data of 17 eddy covariance flux sites of cropland over the globe. Each hybrid model was assessed to estimate ET with ten different input data combinations. In each hybrid model, the ML algorithm was used to model the stomatal conductance (Gs), and then ET was estimated using the Penman-Monteith equation, along with the ML-based Gs. The results showed that all hybrid models can reasonably reproduce ET of cropland with the models using two or more remote sensing (RS) factors. The results also showed that although including RS factors can remarkably contribute to improving ET estimates, hybrid models except for LSTM using three or more RS factors were only marginally better than those using two RS factors. We also evidenced that the ANN-based model exhibits the optimal performance among all ML-based models in modeling daily ET, as indicated by the lower root-mean-square error (RMSE, 18.67–21.23 W m−2) and higher correlations coefficient (r, 0.90–0.94). ANN are more suitable for modeling Gs as compared to other ML algorithms under investigation, being able to provide methodological support for accurate estimation of cropland ET on a regional scale.

Download Full-text

A Hybrid Machine Learning Framework of Gradient Boosting Decision Tree and Sequence Model for Predicting Escalation in Customer Support

2020 IEEE International Conference on Big Data (Big Data) ◽

10.1109/bigdata50022.2020.9377831 ◽

2020 ◽

Author(s):

Shubham Gupta

Keyword(s):

Machine Learning ◽

Decision Tree ◽

Gradient Boosting ◽

Customer Support ◽

Learning Framework ◽

Hybrid Machine

Download Full-text

XGB-RF: A Hybrid Machine Learning Approach for IoT Intrusion Detection

Telecom ◽

10.3390/telecom3010003 ◽

2022 ◽

Vol 3 (1) ◽

pp. 52-69

Author(s):

Jabed Al Faysal ◽

Sk Tahmid Mostafa ◽

Jannatul Sultana Tamanna ◽

Khondoker Mirazul Mumenin ◽

Md. Mashrur Arifin ◽

...

Keyword(s):

Machine Learning ◽

Security And Privacy ◽

Machine Learning Techniques ◽

Gradient Boosting ◽

Network Systems ◽

Learning Techniques ◽

Extreme Gradient Boosting ◽

Hybrid Machine ◽

Security Operations ◽

Iot Devices

In the past few years, Internet of Things (IoT) devices have evolved faster and the use of these devices is exceedingly increasing to make our daily activities easier than ever. However, numerous security flaws persist on IoT devices due to the fact that the majority of them lack the memory and computing resources necessary for adequate security operations. As a result, IoT devices are affected by a variety of attacks. A single attack on network systems or devices can lead to significant damages in data security and privacy. However, machine-learning techniques can be applied to detect IoT attacks. In this paper, a hybrid machine learning scheme called XGB-RF is proposed for detecting intrusion attacks. The proposed hybrid method was applied to the N-BaIoT dataset containing hazardous botnet attacks. Random forest (RF) was used for the feature selection and eXtreme Gradient Boosting (XGB) classifier was used to detect different types of attacks on IoT environments. The performance of the proposed XGB-RF scheme is evaluated based on several evaluation metrics and demonstrates that the model successfully detects 99.94% of the attacks. After comparing it with state-of-the-art algorithms, our proposed model has achieved better performance for every metric. As the proposed scheme is capable of detecting botnet attacks effectively, it can significantly contribute to reducing the security concerns associated with IoT systems.

Download Full-text

An intelligent decision support system for crop yield prediction using hybrid machine learning algorithms

F1000Research ◽

10.12688/f1000research.73009.1 ◽

2021 ◽

Vol 10 ◽

pp. 1143

Author(s):

Kalaiarasi Sonai Muthu Anbananthen ◽

Sridevi Subbiah ◽

Deisy Chelliah ◽

Prithika Sivakumar ◽

Varsha Somasundaram ◽

...

Keyword(s):

Machine Learning ◽

Random Forest ◽

Crop Yield ◽

Data Science ◽

Weather Forecasting ◽

Machine Learning Algorithms ◽

Gradient Boosting ◽

Lasso Regression ◽

Stacked Generalization ◽

Hybrid Machine

Background: In recent times, digitization is gaining importance in different domains of knowledge such as agriculture, medicine, recommendation platforms, the Internet of Things (IoT), and weather forecasting. In agriculture, crop yield estimation is essential for improving productivity and decision-making processes such as financial market forecasting, and addressing food security issues. The main objective of the article is to predict and improve the accuracy of crop yield forecasting using hybrid machine learning (ML) algorithms. Methods: This article proposes hybrid ML algorithms that use specialized ensembling methods such as stacked generalization, gradient boosting, random forest, and least absolute shrinkage and selection operator (LASSO) regression. Stacked generalization is a new model which learns how to best combine the predictions from two or more models trained on the dataset. To demonstrate the applications of the proposed algorithm, aerial-intel datasets from the github data science repository are used. Results: Based on the experimental results done on the agricultural data, the following observations have been made. The performance of the individual algorithm and hybrid ML algorithms are compared using cross-validation to identify the most promising performers for the agricultural dataset. The accuracy of random forest regressor, gradient boosted tree regression, and stacked generalization ensemble methods are 87.71%, 86.98%, and 88.89% respectively. Conclusions: The proposed stacked generalization ML algorithm statistically outperforms with an accuracy of 88.89% and hence demonstrates that the proposed approach is an effective algorithm for predicting crop yield. The system also gives fast and accurate responses to the farmers.

Download Full-text

Forecasting US movies box office performances in Turkey using machine learning algorithms

Journal of Intelligent & Fuzzy Systems ◽

10.3233/jifs-189120 ◽

2020 ◽

Vol 39 (5) ◽

pp. 6579-6590

Author(s):

Sandy Çağlıyor ◽

Başar Öztayşi ◽

Selime Sezgin

Keyword(s):

Machine Learning ◽

Global Economy ◽

Learning Algorithms ◽

Forecast Model ◽

Machine Learning Algorithms ◽

Gradient Boosting ◽

High Stakes ◽

Box Office ◽

Industry Forecast ◽

The Impact

The motion picture industry is one of the largest industries worldwide and has significant importance in the global economy. Considering the high stakes and high risks in the industry, forecast models and decision support systems are gaining importance. Several attempts have been made to estimate the theatrical performance of a movie before or at the early stages of its release. Nevertheless, these models are mostly used for predicting domestic performances and the industry still struggles to predict box office performances in overseas markets. In this study, the aim is to design a forecast model using different machine learning algorithms to estimate the theatrical success of US movies in Turkey. From various sources, a dataset of 1559 movies is constructed. Firstly, independent variables are grouped as pre-release, distributor type, and international distribution based on their characteristic. The number of attendances is discretized into three classes. Four popular machine learning algorithms, artificial neural networks, decision tree regression and gradient boosting tree and random forest are employed, and the impact of each group is observed by compared by the performance models. Then the number of target classes is increased into five and eight and results are compared with the previously developed models in the literature.

Download Full-text

Data science in economics: comprehensive review of advanced machine learning and deep learning methods

10.31232/osf.io/4pxq2 ◽

2020 ◽

Author(s):

Saeed Nosratabadi ◽

Amir Mosavi ◽

Puhong Duan ◽

Pedram Ghamisi ◽

Ferdinand Filip ◽

...

Keyword(s):

Machine Learning ◽

Deep Learning ◽

Data Science ◽

State Of The Art ◽

Science Methods ◽

Learning Models ◽

Diverse Range ◽

Hybrid Machine ◽

Economics Research

This paper provides a state-of-the-art investigation of advances in data science in emerging economic applications. The analysis was performed on novel data science methods in four individual classes of deep learning models, hybrid deep learning models, hybrid machine learning, and ensemble models. Application domains include a wide and diverse range of economics research from the stock market, marketing, and e-commerce to corporate banking and cryptocurrency. Prisma method, a systematic literature review methodology, was used to ensure the quality of the survey. The findings reveal that the trends follow the advancement of hybrid models, which, based on the accuracy metric, outperform other learning algorithms. It is further expected that the trends will converge toward the advancements of sophisticated hybrid deep learning models.

Download Full-text