Machine Learning Models Based on Random Forest Feature Selection and Bayesian Optimization for Predicting Daily Global Solar Radiation

2021 ◽  
Vol 11 (1) ◽  
pp. 309-323
Author(s):  
Mohamed Chaibi ◽  
El Mahjoub Benghoulam ◽  
Lhoussaine Tarik ◽  
Mohamed Berrada ◽  
Abdellah El Hmaidi

Prediction of daily global solar radiation  with simple and highly accurate models would be beneficial for solar energy conversion systems. In this paper, we proposed a hybrid machine learning methodology integrating two feature selection methods and a Bayesian optimization algorithm to predict H in the city of Fez, Morocco. First, we identified the most significant predictors using two Random Forest methods of feature importance: Mean Decrease in Impurity (MDI) and Mean Decrease in Accuracy (MDA). Then, based on the feature selection results, ten models were developed and compared: (1) five standalone machine learning (ML) models including Classification and Regression Trees (CART), Random Forests (RF), Bagged Trees Regression (BTR), Support Vector Regression (SVR), and Multi-Layer Perceptron (MLP); and (2) the same models tuned by the Bayesian optimization (BO) algorithm: CART-BO, RF-BO, BTR-BO, SVR-BO, and MLP-BO. Both MDI and MDA techniques revealed that extraterrestrial solar radiation and sunshine duration fraction were the most influential features. The BO approach improved the predictive accuracy of MLP, CART, SVR, and BTR models and prevented the CART model from overfitting. The best improvements were obtained using the MLP model, where RMSE and MAE were reduced by 17.6% and 17.2%, respectively. Among the studied models, the SVR-BO algorithm provided the best trade-off between prediction accuracy (RMSE=0.4473kWh/m²/day, MAE=0.3381kWh/m²/day, and R²=0.9465), stability (with a 0.0033kWh/m²/day increase in RMSE), and computational cost.

Energies ◽  
2021 ◽  
Vol 14 (21) ◽  
pp. 7367
Author(s):  
Mohamed Chaibi ◽  
EL Mahjoub Benghoulam ◽  
Lhoussaine Tarik ◽  
Mohamed Berrada ◽  
Abdellah El Hmaidi

Machine learning (ML) models are commonly used in solar modeling due to their high predictive accuracy. However, the predictions of these models are difficult to explain and trust. This paper aims to demonstrate the utility of two interpretation techniques to explain and improve the predictions of ML models. We compared first the predictive performance of Light Gradient Boosting (LightGBM) with three benchmark models, including multilayer perceptron (MLP), multiple linear regression (MLR), and support-vector regression (SVR), for estimating the global solar radiation (H) in the city of Fez, Morocco. Then, the predictions of the most accurate model were explained by two model-agnostic explanation techniques: permutation feature importance (PFI) and Shapley additive explanations (SHAP). The results indicated that LightGBM (R2 = 0.9377, RMSE = 0.4827 kWh/m2, MAE = 0.3614 kWh/m2) provides similar predictive accuracy as SVR, and outperformed MLP and MLR in the testing stage. Both PFI and SHAP methods showed that extraterrestrial solar radiation (H0) and sunshine duration fraction (SF) are the two most important parameters that affect H estimation. Moreover, the SHAP method established how each feature influences the LightGBM estimations. The predictive accuracy of the LightGBM model was further improved slightly after re-examination of features, where the model combining H0, SF, and RH was better than the model with all features.


Author(s):  
Harsha A K

Abstract: Since the advent of encryption, there has been a steady increase in malware being transmitted over encrypted networks. Traditional approaches to detect malware like packet content analysis are inefficient in dealing with encrypted data. In the absence of actual packet contents, we can make use of other features like packet size, arrival time, source and destination addresses and other such metadata to detect malware. Such information can be used to train machine learning classifiers in order to classify malicious and benign packets. In this paper, we offer an efficient malware detection approach using classification algorithms in machine learning such as support vector machine, random forest and extreme gradient boosting. We employ an extensive feature selection process to reduce the dimensionality of the chosen dataset. The dataset is then split into training and testing sets. Machine learning algorithms are trained using the training set. These models are then evaluated against the testing set in order to assess their respective performances. We further attempt to tune the hyper parameters of the algorithms, in order to achieve better results. Random forest and extreme gradient boosting algorithms performed exceptionally well in our experiments, resulting in area under the curve values of 0.9928 and 0.9998 respectively. Our work demonstrates that malware traffic can be effectively classified using conventional machine learning algorithms and also shows the importance of dimensionality reduction in such classification problems. Keywords: Malware Detection, Extreme Gradient Boosting, Random Forest, Feature Selection.


2021 ◽  
Author(s):  
Mohamed Chaibi ◽  
EL Mahjoub Benghoulam ◽  
Lhoussaine Tarik ◽  
Mohamed Berrada ◽  
Abdellah El Hmaidi

2021 ◽  
Author(s):  
Yue Jia ◽  
Yongjun Su ◽  
Fengchun Wang ◽  
Pengcheng Li ◽  
Shuyi Huo

Abstract Reliable global solar radiation (Rs) information is crucial for the design and management of solar energy systems for agricultural and industrial production. However, Rs measurements are unavailable in many regions of the world, which impedes the development and application of solar energy. To accurately estimate Rs, this study developed a novel machine learning model, called a Gaussian exponential model (GEM), for daily global Rs estimation. The GEM was compared with four other machine learning models and two empirical models to assess its applicability using daily meteorological data from 1997–2016 from four stations in Northeast China. The results showed that the GEM with complete inputs had the best performance. Machine learning models provided better estimates than empirical models when trained by the same input data. Sunshine duration was the most effective factor determining the accuracy of the machine learning models. Overall, the GEM with complete inputs had the highest accuracy and is recommended for modeling daily Rs in Northeast China.


2020 ◽  
Vol 10 (3) ◽  
pp. 869 ◽  
Author(s):  
Hong Zhang ◽  
Jian Zhou ◽  
Danial Jahed Armaghani ◽  
M. M. Tahir ◽  
Binh Thai Pham ◽  
...  

In mining and civil engineering applications, a reliable and proper analysis of ground vibration due to quarry blasting is an extremely important task. While advances in machine learning led to numerous powerful regression models, the usefulness of these models for modeling the peak particle velocity (PPV) remains largely unexplored. Using an extensive database comprising quarry site datasets enriched with vibration variables, this article compares the predictive performance of five selected machine learning classifiers, including classification and regression trees (CART), chi-squared automatic interaction detection (CHAID), random forest (RF), artificial neural network (ANN), and support vector machine (SVM) for PPV analysis. Before conducting these model developments, feature selection was applied in order to select the most important input parameters for PPV. The results of this study show that RF performed substantially better than any of the other investigated regression models, including the frequently used SVM and ANN models. The results and process analysis of this study can be utilized by other researchers/designers in similar fields.


2020 ◽  
Vol 2020 ◽  
pp. 1-11
Author(s):  
Sandeep Dhakal ◽  
Yogesh Gautam ◽  
Aayush Bhattarai

Global solar radiation (GSR) is a critical variable for designing photovoltaic cells, solar furnaces, solar collectors, and other passive solar applications. In Nepal, the high initial cost and subsequent maintenance cost required for the instrument to measure GSR have restricted its applicability all over the country. The current study compares six different temperature-based empirical models, artificial neural network (ANN), and other five different machine learning (ML) models for estimating daily GSR utilizing readily available meteorological data at Biratnagar Airport. Amongst the temperature-based models, the model developed by Fan et al. performs better than the rest with an R2 of 0.7498 and RMSE of 2.0162 MJm−2d−1. Feed-forward multilayer perceptron (MLP) is utilized to model daily GSR utilizing extraterrestrial solar radiation, sunshine duration, maximum and minimum ambient temperature, precipitation, and relative humidity as inputs. ANN3 performs better than other ANN models with an R2 of 0.8446 and RMSE of 1.4595 MJm−2d−1. Likewise, stepwise linear regression performs better than other ML models with an R2 of 0.8870 and RMSE of 1.5143 MJm−2d−1. Thus, the model developed by Fan et al. is recommended to estimate daily GSR in the region where only ambient temperature data are available. Similarly, a more robust ANN3 and stepwise linear regression models are recommended to estimate daily GSR in the region where data about sunshine duration, maximum and minimum ambient temperature, precipitation, and relative humidity are available.


Complexity ◽  
2021 ◽  
Vol 2021 ◽  
pp. 1-12
Author(s):  
Nahla F. Omran ◽  
Sara F. Abd-el Ghany ◽  
Hager Saleh ◽  
Ayman Nabil

Twitter integrates with streaming data technologies and machine learning to add new value to healthcare. This paper presented a real-time system to predict breast cancer based on streaming patient’s health data from Twitter. The proposed system consists of two major components: developing an offline building model and an online prediction pipeline. For the first component, we made a correlation between the features to determine the correlation between features and reduce the number of features from the Breast Cancer Wisconsin Diagnostic dataset. Two feature selection algorithms are recursive feature elimination and univariate feature selection algorithms which are applied to features after correlation to select the essential features. Four decision trees, logistic regression, support vector machine, and random forest classifier have been used on features after correlation and feature selection. Also, hyperparameter tuning and cross-validation have been applied with machine learning to optimize models and enhance accuracy. Apache Spark, Apache Kafka, and Twitter Streaming API are used to develop the second component. The best model with the highest accuracy obtained from the first component predicts breast cancer in real time from tweets’ streaming. The results showed that the best model is the random forest classifier which achieved the best accuracy.


2020 ◽  
Vol 2020 ◽  
pp. 1-26
Author(s):  
Abdurrahman Burak Guher ◽  
Sakir Tasdemir ◽  
Bulent Yaniktepe

The precise estimation of solar radiation is of great importance in solar energy applications with respect to installation and capacity. In estimate modelling on selected target locations, various computer-based and experimental methods and techniques are employed. In the present study, the Multilayer Feed-Forward Neural Network (MFFNN), K -Nearest Neighbors ( K -NN), a Library for Support Vector Machines (LibSVM), and M5 rules algorithms, which are among the Machine Learning (ML) algorithms, were used to estimate the hourly average solar radiation of two geographic locations on the same latitude. The input variables that had the most impact on solar radiation were identified and grouped as a result of 29 different applications that were developed by using 6 different feature selection methods with Waikato Environment for Knowledge Analysis (WEKA) software. Estimation models were developed by using the selected data groups and all input variables for each target location. The results show that the estimations developed with the feature selection method were more successful for target locations, and the radiation potentials were similar. The performance of the estimation models was evaluated by comparing each model with different statistical indicators and with previous studies. According to the RMSE, MAE, R 2 , and SMAPE statistical scales, the results of the most successful estimation models that were developed with MFFNN were 0.0508-0.0536, 0.0341-0.0352, 0.9488-0.9656, and 7.77%-7.79%, respectively.


Sign in / Sign up

Export Citation Format

Share Document