An Interpretable Machine Learning Model for Daily Global Solar Radiation Prediction

Machine learning (ML) models are commonly used in solar modeling due to their high predictive accuracy. However, the predictions of these models are difficult to explain and trust. This paper aims to demonstrate the utility of two interpretation techniques to explain and improve the predictions of ML models. We compared first the predictive performance of Light Gradient Boosting (LightGBM) with three benchmark models, including multilayer perceptron (MLP), multiple linear regression (MLR), and support-vector regression (SVR), for estimating the global solar radiation (H) in the city of Fez, Morocco. Then, the predictions of the most accurate model were explained by two model-agnostic explanation techniques: permutation feature importance (PFI) and Shapley additive explanations (SHAP). The results indicated that LightGBM (R2 = 0.9377, RMSE = 0.4827 kWh/m2, MAE = 0.3614 kWh/m2) provides similar predictive accuracy as SVR, and outperformed MLP and MLR in the testing stage. Both PFI and SHAP methods showed that extraterrestrial solar radiation (H0) and sunshine duration fraction (SF) are the two most important parameters that affect H estimation. Moreover, the SHAP method established how each feature influences the LightGBM estimations. The predictive accuracy of the LightGBM model was further improved slightly after re-examination of features, where the model combining H0, SF, and RH was better than the model with all features.

Download Full-text

Machine Learning Models Based on Random Forest Feature Selection and Bayesian Optimization for Predicting Daily Global Solar Radiation

International Journal of Renewable Energy Development ◽

10.14710/ijred.2022.41451 ◽

2021 ◽

Vol 11 (1) ◽

pp. 309-323

Author(s):

Mohamed Chaibi ◽

El Mahjoub Benghoulam ◽

Lhoussaine Tarik ◽

Mohamed Berrada ◽

Abdellah El Hmaidi

Keyword(s):

Machine Learning ◽

Feature Selection ◽

Random Forest ◽

Solar Radiation ◽

Predictive Accuracy ◽

Sunshine Duration ◽

Computational Cost ◽

Global Solar Radiation ◽

Bayesian Optimization ◽

Support Vector

Prediction of daily global solar radiation with simple and highly accurate models would be beneficial for solar energy conversion systems. In this paper, we proposed a hybrid machine learning methodology integrating two feature selection methods and a Bayesian optimization algorithm to predict H in the city of Fez, Morocco. First, we identified the most significant predictors using two Random Forest methods of feature importance: Mean Decrease in Impurity (MDI) and Mean Decrease in Accuracy (MDA). Then, based on the feature selection results, ten models were developed and compared: (1) five standalone machine learning (ML) models including Classification and Regression Trees (CART), Random Forests (RF), Bagged Trees Regression (BTR), Support Vector Regression (SVR), and Multi-Layer Perceptron (MLP); and (2) the same models tuned by the Bayesian optimization (BO) algorithm: CART-BO, RF-BO, BTR-BO, SVR-BO, and MLP-BO. Both MDI and MDA techniques revealed that extraterrestrial solar radiation and sunshine duration fraction were the most influential features. The BO approach improved the predictive accuracy of MLP, CART, SVR, and BTR models and prevented the CART model from overfitting. The best improvements were obtained using the MLP model, where RMSE and MAE were reduced by 17.6% and 17.2%, respectively. Among the studied models, the SVR-BO algorithm provided the best trade-off between prediction accuracy (RMSE=0.4473kWh/m²/day, MAE=0.3381kWh/m²/day, and R²=0.9465), stability (with a 0.0033kWh/m²/day increase in RMSE), and computational cost.

Download Full-text

A New Hybrid Model for Hourly Solar Radiation Forecasting Using Daily Classification Technique and Machine Learning Algorithms

Mathematical Problems in Engineering ◽

10.1155/2021/6692626 ◽

2021 ◽

Vol 2021 ◽

pp. 1-12

Author(s):

Hamza Ali-Ou-Salah ◽

Benyounes Oukarfi ◽

Khalid Bahani ◽

Mohammed Moujabbir

Keyword(s):

Machine Learning ◽

Solar Radiation ◽

Global Solar Radiation ◽

Machine Learning Algorithms ◽

Gradient Boosting ◽

Support Vector ◽

Forecasting Accuracy ◽

Photovoltaic Power ◽

Classification Technique ◽

Artificial Neural Network Ann

Photovoltaic power generation depends significantly on solar radiation, which is variable and unpredictable in nature. As a result, the production of electricity from photovoltaic power cannot be guaranteed permanently during the operational phase. Forecasting global solar radiation can play a key role in overcoming this drawback of intermittency. This paper proposes a new hybrid method based on machine learning (ML) algorithms and daily classification technique to forecast 1 h ahead of global solar radiation in the city of Évora. Firstly, several comparative studies have been done between random forest (RF), gradient boosting (GB), support vector machines (SVM), and artificial neural network (ANN). These comparisons were made using annual, seasonal, and daily testing sets in order to determine the best ML algorithm under different meteorological conditions. Subsequently, the daily classification technique has been applied to classify the original training set into sunny and cloudy training subsets in order to enhance the forecasting accuracy. The evaluation of the proposed ML algorithms was carried out using the normalized root mean square error (nRMSE) and the normalized absolute mean error (nMAE). The results of the seasonal comparison show that the RF model performs well for spring and autumn seasons with nRMSE equaling 22.53% and 23.42%, respectively. While the SVR model gives good results for winter and summer seasons with nRMSE equaling 24.31% and 8.41%, respectively. In addition, the daily comparison demonstrates that the RF model performs well for cloudy days with nRMSE = 41.40%, while the SVR model yields good results for sunny days with nRMSE = 8.88%. The results show that the daily classification technique enhances the forecasting accuracy of ML models. Furthermore, this study demonstrates that the forecasting accuracy of ML algorithms depends significantly on sky conditions.

Download Full-text

The transferability of random forest and support vector machine for estimating daily global solar radiation using sunshine duration over different climate zones

Theoretical and Applied Climatology ◽

10.1007/s00704-021-03726-6 ◽

2021 ◽

Author(s):

Wei Wu ◽

Mao-Fen Li ◽

Xia Xu ◽

Xiao-Ping Tang ◽

Chao Yang ◽

...

Keyword(s):

Support Vector Machine ◽

Random Forest ◽

Solar Radiation ◽

Sunshine Duration ◽

Global Solar Radiation ◽

Support Vector ◽

Climate Zones

Download Full-text

Establishing a Credit Risk Evaluation System for SMEs Using the Soft Voting Fusion Model

Risks ◽

10.3390/risks9110202 ◽

2021 ◽

Vol 9 (11) ◽

pp. 202

Author(s):

Ge Gao ◽

Hongxin Wang ◽

Pengbin Gao

Keyword(s):

Credit Risk ◽

Evaluation System ◽

Predictive Accuracy ◽

Assessment System ◽

Gradient Boosting ◽

Support Vector ◽

Fusion Model ◽

Light Gradient ◽

Extreme Gradient Boosting ◽

The Government

In China, SMEs are facing financing difficulties, and commercial banks and financial institutions are the main financing channels for SMEs. Thus, a reasonable and efficient credit risk assessment system is important for credit markets. Based on traditional statistical methods and AI technology, a soft voting fusion model, which incorporates logistic regression, support vector machine (SVM), random forest (RF), eXtreme Gradient Boosting (XGBoost), and Light Gradient Boosting Machine (LightGBM), is constructed to improve the predictive accuracy of SMEs’ credit risk. To verify the feasibility and effectiveness of the proposed model, we use data from 123 SMEs nationwide that worked with a Chinese bank from 2016 to 2020, including financial information and default records. The results show that the accuracy of the soft voting fusion model is higher than that of a single machine learning (ML) algorithm, which provides a theoretical basis for the government to control credit risk in the future and offers important references for banks to make credit decisions.

Download Full-text

Interpretable Machine Learning for Early Neurological Deterioration Prediction in Atrial Fibrillation-Related Stroke

10.21203/rs.3.rs-446890/v1 ◽

2021 ◽

Author(s):

Seong Hwan Kim ◽

Eun-Tae Jeon ◽

Sungwook Yu ◽

Kyungmi O ◽

Chi Kyung Kim ◽

...

Keyword(s):

Machine Learning ◽

Atrial Fibrillation ◽

Neurological Deterioration ◽

Gradient Boosting ◽

Support Vector ◽

Light Gradient ◽

Interpretable Machine Learning ◽

Extreme Gradient Boosting ◽

Early Neurological Deterioration ◽

Feature Importance

Abstract We aimed to develop a novel prediction model for early neurological deterioration (END) based on an interpretable machine learning (ML) algorithm for atrial fibrillation (AF)-related stroke and to evaluate the prediction accuracy and feature importance of ML models. Data from multi-center prospective stroke registries in South Korea were collected. After stepwise data preprocessing, we utilized logistic regression, support vector machine, extreme gradient boosting, light gradient boosting machine (LightGBM), and multilayer perceptron models. We used the Shapley additive explanations (SHAP) method to evaluate feature importance. Of the 3,623 stroke patients, the 2,363 who had arrived at the hospital within 24 hours of symptom onset and had available information regarding END were included. Of these, 318 (13.5%) had END. The LightGBM model showed the highest area under the receiver operating characteristic curve (0.778, 95% CI, 0.726 - 0.830). The feature importance analysis revealed that fasting glucose level and the National Institute of Health Stroke Scale score were the most influential factors. Among ML algorithms, the LightGBM model was particularly useful for predicting END, as it revealed new and diverse predictors. Additionally, the SHAP method can be adjusted to individualize the features’ effects on the predictive power of the model.

Download Full-text

Mapping of the Canopy Openings in Mixed Beech–Fir Forest at Sentinel-2 Subpixel Level Using UAV and Machine Learning Approach

Remote Sensing ◽

10.3390/rs12233925 ◽

2020 ◽

Vol 12 (23) ◽

pp. 3925

Author(s):

Ivan Pilaš ◽

Mateo Gašparović ◽

Alan Novkinić ◽

Damir Klobučar

Keyword(s):

Machine Learning ◽

Forest Canopy ◽

Vegetation Index ◽

Predictive Performance ◽

Spatial Extent ◽

Gradient Boosting ◽

Support Vector ◽

Stochastic Gradient Boosting ◽

Extreme Gradient Boosting ◽

Sentinel 2

The presented study demonstrates a bi-sensor approach suitable for rapid and precise up-to-date mapping of forest canopy gaps for the larger spatial extent. The approach makes use of Unmanned Aerial Vehicle (UAV) red, green and blue (RGB) images on smaller areas for highly precise forest canopy mask creation. Sentinel-2 was used as a scaling platform for transferring information from the UAV to a wider spatial extent. Various approaches to an improvement in the predictive performance were examined: (I) the highest R2 of the single satellite index was 0.57, (II) the highest R2 using multiple features obtained from the single-date, S-2 image was 0.624, and (III) the highest R2 on the multitemporal set of S-2 images was 0.697. Satellite indices such as Atmospherically Resistant Vegetation Index (ARVI), Infrared Percentage Vegetation Index (IPVI), Normalized Difference Index (NDI45), Pigment-Specific Simple Ratio Index (PSSRa), Modified Chlorophyll Absorption Ratio Index (MCARI), Color Index (CI), Redness Index (RI), and Normalized Difference Turbidity Index (NDTI) were the dominant predictors in most of the Machine Learning (ML) algorithms. The more complex ML algorithms such as the Support Vector Machines (SVM), Random Forest (RF), Stochastic Gradient Boosting (GBM), Extreme Gradient Boosting (XGBoost), and Catboost that provided the best performance on the training set exhibited weaker generalization capabilities. Therefore, a simpler and more robust Elastic Net (ENET) algorithm was chosen for the final map creation.

Download Full-text

Empirical and machine learning models for predicting daily global solar radiation from sunshine duration: A review and case study in China

Renewable and Sustainable Energy Reviews ◽

10.1016/j.rser.2018.10.018 ◽

2019 ◽

Vol 100 ◽

pp. 186-212 ◽

Cited By ~ 68

Author(s):

Junliang Fan ◽

Lifeng Wu ◽

Fucang Zhang ◽

Huanjie Cai ◽

Wenzhi Zeng ◽

...

Keyword(s):

Machine Learning ◽

Solar Radiation ◽

Sunshine Duration ◽

Global Solar Radiation ◽

Learning Models ◽

Machine Learning Models

Download Full-text

Multistep-Ahead Solar Radiation Forecasting Scheme Based on the Light Gradient Boosting Machine: A Case Study of Jeju Island

Remote Sensing ◽

10.3390/rs12142271 ◽

2020 ◽

Vol 12 (14) ◽

pp. 2271 ◽

Cited By ~ 2

Author(s):

Jinwoong Park ◽

Jihoon Moon ◽

Seungmin Jung ◽

Eenjun Hwang

Keyword(s):

Solar Radiation ◽

Global Solar Radiation ◽

Jeju Island ◽

Gradient Boosting ◽

Probabilistic Forecasting ◽

Training Time ◽

Light Gradient ◽

Proposed Model ◽

Gradient Boosting Machine ◽

Time Problem

Smart islands have focused on renewable energy sources, such as solar and wind, to achieve energy self-sufficiency. Because solar photovoltaic (PV) power has the advantage of less noise and easier installation than wind power, it is more flexible in selecting a location for installation. A PV power system can be operated more efficiently by predicting the amount of global solar radiation for solar power generation. Thus far, most studies have addressed day-ahead probabilistic forecasting to predict global solar radiation. However, day-ahead probabilistic forecasting has limitations in responding quickly to sudden changes in the external environment. Although multistep-ahead (MSA) forecasting can be used for this purpose, traditional machine learning models are unsuitable because of the substantial training time. In this paper, we propose an accurate MSA global solar radiation forecasting model based on the light gradient boosting machine (LightGBM), which can handle the training-time problem and provide higher prediction performance compared to other boosting methods. To demonstrate the validity of the proposed model, we conducted a global solar radiation prediction for two regions on Jeju Island, the largest island in South Korea. The experiment results demonstrated that the proposed model can achieve better predictive performance than the tree-based ensemble and deep learning methods.

Download Full-text

Computational analysis and prediction of lysine malonylation sites by exploiting informative features in an integrative machine-learning framework

Briefings in Bioinformatics ◽

10.1093/bib/bby079 ◽

2018 ◽

Vol 20 (6) ◽

pp. 2185-2199 ◽

Cited By ~ 32

Author(s):

Yanju Zhang ◽

Ruopeng Xie ◽

Jiawei Wang ◽

André Leier ◽

Tatiana T Marquez-Lago ◽

...

Keyword(s):

Machine Learning ◽

Computational Methods ◽

Prediction Models ◽

Gradient Boosting ◽

Support Vector ◽

Post Translational Modification ◽

K Nearest Neighbor ◽

Ensemble Models ◽

Light Gradient ◽

Optimal Ensemble

AbstractAs a newly discovered post-translational modification (PTM), lysine malonylation (Kmal) regulates a myriad of cellular processes from prokaryotes to eukaryotes and has important implications in human diseases. Despite its functional significance, computational methods to accurately identify malonylation sites are still lacking and urgently needed. In particular, there is currently no comprehensive analysis and assessment of different features and machine learning (ML) methods that are required for constructing the necessary prediction models. Here, we review, analyze and compare 11 different feature encoding methods, with the goal of extracting key patterns and characteristics from residue sequences of Kmal sites. We identify optimized feature sets, with which four commonly used ML methods (random forest, support vector machines, K-nearest neighbor and logistic regression) and one recently proposed [Light Gradient Boosting Machine (LightGBM)] are trained on data from three species, namely, Escherichia coli, Mus musculus and Homo sapiens, and compared using randomized 10-fold cross-validation tests. We show that integration of the single method-based models through ensemble learning further improves the prediction performance and model robustness on the independent test. When compared to the existing state-of-the-art predictor, MaloPred, the optimal ensemble models were more accurate for all three species (AUC: 0.930, 0.923 and 0.944 for E. coli, M. musculus and H. sapiens, respectively). Using the ensemble models, we developed an accessible online predictor, kmal-sp, available at http://kmalsp.erc.monash.edu/. We hope that this comprehensive survey and the proposed strategy for building more accurate models can serve as a useful guide for inspiring future developments of computational methods for PTM site prediction, expedite the discovery of new malonylation and other PTM types and facilitate hypothesis-driven experimental validation of novel malonylated substrates and malonylation sites.

Download Full-text

Global solar radiation modeling using different machine learning and empirical models in Northeast China

10.21203/rs.3.rs-422151/v1 ◽

2021 ◽

Author(s):

Yue Jia ◽

Yongjun Su ◽

Fengchun Wang ◽

Pengcheng Li ◽

Shuyi Huo

Keyword(s):

Machine Learning ◽

Solar Radiation ◽

Solar Energy ◽

Northeast China ◽

Sunshine Duration ◽

Meteorological Data ◽

Global Solar Radiation ◽

Empirical Models ◽

Learning Models ◽

Machine Learning Models

Abstract Reliable global solar radiation (Rs) information is crucial for the design and management of solar energy systems for agricultural and industrial production. However, Rs measurements are unavailable in many regions of the world, which impedes the development and application of solar energy. To accurately estimate Rs, this study developed a novel machine learning model, called a Gaussian exponential model (GEM), for daily global Rs estimation. The GEM was compared with four other machine learning models and two empirical models to assess its applicability using daily meteorological data from 1997–2016 from four stations in Northeast China. The results showed that the GEM with complete inputs had the best performance. Machine learning models provided better estimates than empirical models when trained by the same input data. Sunshine duration was the most effective factor determining the accuracy of the machine learning models. Overall, the GEM with complete inputs had the highest accuracy and is recommended for modeling daily Rs in Northeast China.

Download Full-text