Mineral grade estimation using gradient boosting regression trees

Author(s):  
Umit Emrah Kaplan ◽  
Yasin Dagasan ◽  
Erkan Topal
Author(s):  
Yu Shi ◽  
Jian Li ◽  
Zhize Li

Gradient Boosted Decision Trees (GBDT) is a very successful ensemble learning algorithm widely used across a variety of applications. Recently, several variants of GBDT training algorithms and implementations have been designed and heavily optimized in some very popular open sourced toolkits including XGBoost, LightGBM and CatBoost. In this paper, we show that both the accuracy and efficiency of GBDT can be further enhanced by using more complex base learners. Specifically, we extend gradient boosting to use piecewise linear regression trees (PL Trees), instead of piecewise constant regression trees, as base learners. We show that PL Trees can accelerate convergence of GBDT and improve the accuracy. We also propose some optimization tricks to substantially reduce the training time of PL Trees, with little sacrifice of accuracy. Moreover, we propose several implementation techniques to speedup our algorithm on modern computer architectures with powerful Single Instruction Multiple Data (SIMD) parallelism. The experimental results show that GBDT with PL Trees can provide very competitive testing accuracy with comparable or less training time.


2020 ◽  
Vol 24 (5) ◽  
pp. 2343-2363
Author(s):  
Shengli Liao ◽  
Zhanwei Liu ◽  
Benxi Liu ◽  
Chuntian Cheng ◽  
Xinfeng Jin ◽  
...  

Abstract. Inflow forecasting plays an essential role in reservoir management and operation. The impacts of climate change and human activities have made accurate inflow prediction increasingly difficult, especially for longer lead times. In this study, a new hybrid inflow forecast framework – using the ERA-Interim reanalysis data set as input and adopting gradient-boosting regression trees (GBRT) and the maximal information coefficient (MIC) – is developed for multistep-ahead daily inflow forecasting. Firstly, the ERA-Interim reanalysis data set provides more information for the framework, allowing it to discover inflow for longer lead times. Secondly, MIC can identify an effective feature subset from massive features that significantly affects inflow; therefore, the framework can reduce computational burden, distinguish key attributes from unimportant ones and provide a concise understanding of inflow. Lastly, GBRT is a prediction model in the form of an ensemble of decision trees, and it has a strong ability to more fully capture nonlinear relationships between input and output at longer lead times. The Xiaowan hydropower station, located in Yunnan Province, China, was selected as the study area. Six evaluation criteria, namely the mean absolute error (MAE), the root-mean-squared error (RMSE), the Pearson correlation coefficient (CORR), Kling–Gupta efficiency (KGE) scores, the percent bias in the flow duration curve high-segment volume (BHV) and the index of agreement (IA) are used to evaluate the established models utilizing historical daily inflow data (1 January 2017–31 December 2018). The performance of the presented framework is compared to that of artificial neural network (ANN), support vector regression (SVR) and multiple linear regression (MLR) models. The results indicate that reanalysis data enhance the accuracy of inflow forecasting for all of the lead times studied (1–10 d), and the method developed generally performs better than other models, especially for extreme values and longer lead times (4–10 d).


2019 ◽  
Author(s):  
Shengli Liao ◽  
Zhanwei Liu ◽  
Benxi Liu ◽  
Chuntian Cheng ◽  
Xinfeng Jin ◽  
...  

Abstract. Inflow forecasting plays an essential role in reservoir management and operation. The impacts of climate change and human activities make accurate inflow prediction increasingly difficult, especially for longer lead times. In this study, a new hybrid inflow forecast framework with ERA-Interim reanalysis data as input, adopting gradient boosting regression trees (GBRT) and the maximum information coefficient (MIC) was developed for multi-step ahead daily inflow forecasting. Firstly, the ERA-Interim reanalysis dataset provides enough information for the framework to discover inflow for longer lead times. Secondly, MIC can identify effective feature subset from massive features that significantly affects inflow so that the framework can avoid over-fitting, distinguish key attributes with unimportant ones and provide a concise understanding of inflow. Lastly, the GBRT is a prediction model in the form of an ensemble of decision trees and has a strong ability to capture nonlinear relationships between input and output in long lead times more fully. The Xiaowan hydropower station located in Yunnan Province, China is selected as the study area. Four evaluation criteria, the mean absolute error (MAE), the root mean square error (RMSE), the Nash-Sutcliffe efficiency coefficient (NSE) and the Pearson correlation coefficient (CORR), were used to evaluate the established models using historical daily inflow data (1/1/2017–31/12/2018). Performance of the presented framework was compared to that of artificial neural networks (ANN), support vector regression (SVR) and multiple linear regression (MLR) models. The experimental results indicate that the developed method generally performs better than other models and significantly improves the accuracy of inflow forecasting at lead times of 5–10 days. The reanalysis data also enhances the accuracy of inflow forecasting except for forecasts that are one-day ahead.


2020 ◽  
Author(s):  
Miae Kim ◽  
Jan Cermak ◽  
Hendrik Andersen ◽  
Julia Fuchs ◽  
Roland Stirnberg

<div>This contribution presents a technique for the machine-learning-based retrieval of cloud liquid water path. Cloud effects are among the major uncertainties in climate models for estimating and predicting the Earth’s energy budget. The study of cloud processes requires information on cloud physical properties, such as the liquid water path (LWP), which is commonly retrieved from satellite sensors using look-up table approaches. However, the accuracy of LWP varies temporally and spatially, also due to assumptions inherent in any physical retrieval. The aim of this study is to improve the accuracy of LWP and analyze quantitatively the accuracy and its errors. To this end, a statistical LWP retrieval was developed using spectral information from geostationary satellite channels (Meteosat Spinning-Enhanced Visible and Infrared Imager, SEVIRI), and satellite viewing geometry. The machine-learning method chosen is gradient-boosted regression trees (GBRTs), which is an ensemble of decision trees but more effective than traditional tree-based models. This study reports on first results, as well as a comparison between the GBRT-derived LWP estimates and those from the SEVIRI-based products of the Climate Monitoring Satellite Application Facility (CM-SAF, CLAAS-A2), as well as MODIS products. We use case studies for individual in-situ measurement sites in Europe under varying meteorological conditions to determine the factors influencing LWP retrieval quality.</div>


2019 ◽  
Vol 40 (Supplement_1) ◽  
Author(s):  
A Agibetov ◽  
B Seirer ◽  
S Aschauer ◽  
D Dalos ◽  
R Rettl ◽  
...  

Abstract Background/Introduction Cardiac amyloidosis (CA) is a rare and complex condition with poor prognosis. Novel therapies have been shown to improve outcome, however, most of the affected individuals remain undiagnosed, mainly due to a lack in awareness among clinicians. One approach to overcome this issue is to use automated diagnostic algorithms that act based on routinely available laboratory results. Purpose We tested the performance of flexible machine learning and traditional statistical prediction models for non-invasive CA diagnosis based on routinely collected laboratory parameters. Since laboratory routines vary between hospitals or other health care providers, special attention has been taken to adaptive and dynamic parameter selection, and to dealing with the frequent occurrence of missing values. Methods Our cohort consisted of 376 clinically accepted patients with various types of heart failure. Of these, 69 were diagnosed with CA via endomyocardial biopsy (positives), and 307 had unrelated cardiac disorders (negatives). A total of 63 routine laboratory parameters were collected from these patients, with a high incidence of missing values (on average 60% of patients for each parameter). We tested the performance of two prediction models: logistic regression, and extreme gradient boosting with regression trees. To deal with missing values we adopted two strategies: a) finding an optimal overlap of parameters and deleting all patients with missing values (reduction of parameters and samples), and b) retaining all features and imputing missing values with parameter-wise means. To fairly assess the performance of prediction models we employed a 10-fold cross validation (stratified to preserve sample class ratio). Finally, area under curve for receiver-operator characteristic (ROC AUC) was used as our final performance measure. Results A complex machine learning model based on forests of regression trees proved to be the most performant (ROC AUC 0.94±4%) and robust to missing values. The best regression model was obtained with the 25 most frequent variables and patient deletion in case of missing values (ROC AUC 0.82±0.8%). While progressive inclusion of predictor variables worsened the performance of the logistic regression, it increased that of the machine learning approach. Conclusions Extreme gradient boosting of regression trees by routine laboratory parameters achieved staggering accuracy results for the automated diagnosis of CA. Our data suggest that implementations of such algorithms as independent interpreters of routine laboratory results may help to establish or suggest the diagnosis of CA in patients with heart failure symptoms, even in the absence of specialized experts.


Materials ◽  
2020 ◽  
Vol 13 (19) ◽  
pp. 4331
Author(s):  
Itzel Nunez ◽  
Afshin Marani ◽  
Moncef L. Nehdi

Recycled aggregate concrete (RAC) contributes to mitigating the depletion of natural aggregates, alleviating the carbon footprint of concrete construction, and averting the landfilling of colossal amounts of construction and demolition waste. However, complexities in the mixture optimization of RAC due to the variability of recycled aggregates and lack of accuracy in estimating its compressive strength require novel and sophisticated techniques. This paper aims at developing state-of-the-art machine learning models to predict the RAC compressive strength and optimize its mixture design. Results show that the developed models including Gaussian processes, deep learning, and gradient boosting regression achieved robust predictive performance, with the gradient boosting regression trees yielding highest prediction accuracy. Furthermore, a particle swarm optimization coupled with gradient boosting regression trees model was developed to optimize the mixture design of RAC for various compressive strength classes. The hybrid model achieved cost-saving RAC mixture designs with lower environmental footprint for different target compressive strength classes. The model could be further harvested to achieve sustainable concrete with optimal recycled aggregate content, least cost, and least environmental footprint.


Author(s):  
Yu. I. Zhuravlev ◽  
O. V. Senko ◽  
A. A. Dokukin ◽  
N. N. Kiselyova ◽  
I. A. Saenko

Abstract The article discusses a new two-level regression analysis method in which a corrective procedure is applied to optimal ensembles of regression trees. Optimization is carried out based on the simultaneous achievement of the divergence of the algorithms in the forecast space and a good approximation of the data by individual algorithms of the ensemble. Simple averaging, random regression forest, and gradient boosting are used as corrective procedures. Experiments are presented comparing the proposed method with the standard decision forest and the standard gradient boosting method for decision trees.


Sign in / Sign up

Export Citation Format

Share Document