A Modified Bayesian Optimization based Hyper-Parameter Tuning Approach for Extreme Gradient Boosting

Machine learning techniques lend themselves as promising decision-making and analytic tools in a wide range of applications. Different ML algorithms have various hyper-parameters. In order to tailor an ML model towards a specific application, a large number of hyper-parameters should be tuned. Tuning the hyper-parameters directly affects the performance (accuracy and run-time). However, for large-scale search spaces, efficiently exploring the ample number of combinations of hyper-parameters is computationally challenging. Existing automated hyper-parameter tuning techniques suffer from high time complexity. In this paper, we propose HyP-ABC, an automatic innovative hybrid hyper-parameter optimization algorithm using the modified artificial bee colony approach, to measure the classification accuracy of three ML algorithms, namely random forest, extreme gradient boosting, and support vector machine. Compared to the state-of-the-art techniques, HyP-ABC is more efficient and has a limited number of parameters to be tuned, making it worthwhile for real-world hyper-parameter optimization problems. We further compare our proposed HyP-ABC algorithm with state-of-the-art techniques. In order to ensure the robustness of the proposed method, the algorithm takes a wide range of feasible hyper-parameter values, and is tested using a real-world educational dataset.

Download Full-text

HyP-ABC: A Novel Automated Hyper-Parameter Tuning Algorithm Using Evolutionary Optimization

10.36227/techrxiv.14714508.v3 ◽

2021 ◽

Author(s):

Leila Zahedi ◽

Farid Ghareh Mohammadi ◽

M. Hadi Amini

Keyword(s):

Real World ◽

Large Scale ◽

Convergence Rates ◽

Parameter Tuning ◽

Population Based ◽

Machine Learning Techniques ◽

Gradient Boosting ◽

Support Vector ◽

Wide Range ◽

Extreme Gradient Boosting

<p>Machine learning techniques lend themselves as promising decision-making and analytic tools in a wide range of applications. Different ML algorithms have various hyper-parameters. In order to tailor an ML model towards a specific application working at its best, its hyper-parameters should be tuned. Tuning the hyper-parameters directly affects the performance. However, for large-scale search spaces, efficiently exploring the ample number of combinations of hyper-parameters is computationally expensive. Many of the automated hyper-parameter tuning techniques suffer from low convergence rates and high experimental time complexities. In this paper, we propose HyP-ABC, an automatic innovative hybrid hyper-parameter optimization algorithm using the modified artificial bee colony approach, to measure the classification accuracy of three ML algorithms: random forest, extreme gradient boosting, and support vector machine. In order to ensure the robustness of the proposed method, the algorithm takes a wide range of feasible hyper-parameter values and is tested using a real-world educational dataset. Experimental results show that HyP-ABC is competitive with state-of-the-art techniques. Also, it has fewer hyper-parameters to be tuned than other population-based algorithms, making it worthwhile for real-world HPO problems.</p>

Download Full-text

Comparison of the performance of machine learning algorithms in breast cancer screening and detection: A protocol

Journal of Public Health Research ◽

10.4081/jphr.2019.1677 ◽

2019 ◽

Vol 8 (3) ◽

Cited By ~ 2

Author(s):

Zakia Salod ◽

Yashik Singh

Keyword(s):

Breast Cancer ◽

Screening Method ◽

Parameter Tuning ◽

Developed Countries ◽

Machine Learning Algorithms ◽

Gradient Boosting ◽

K Nearest Neighbors ◽

Adaptive Boosting ◽

Triple Assessment ◽

Extreme Gradient Boosting

Background: Breast Cancer (BC) is a known global crisis. TheWorld Health Organization reports a global 2.09 million inci-dences and 627,000 deaths in 2018 relating to BC. The traditionalBC screening method in developed countries is mammography,whilst developing countries employ breast self-examination andclinical breast examination. The prominent gold standard for BCdetection is triple assessment: i) clinical examination, ii) mam-mography and/or ultrasonography; and iii) Fine Needle AspirateCytology. However, the introduction of cheaper, efficient and non-invasive methods of BC screening and detection would be benefi-cial. Design and methods: We propose the use of eight machinelearning algorithms: i) Logistic Regression; ii) Support VectorMachine; iii) K-Nearest Neighbors; iv) Decision Tree; v) RandomForest; vi) Adaptive Boosting; vii) Gradient Boosting; viii)eXtreme Gradient Boosting, and blood test results using BCCoimbra Dataset (BCCD) from University of California Irvineonline database to create models for BC prediction. To ensure themodels’ robustness, we will employ: i) Stratified k-fold Cross-Validation; ii) Correlation-based Feature Selection (CFS); and iii)parameter tuning. The models will be validated on validation andtest sets of BCCD for full features and reduced features. Featurereduction has an impact on algorithm performance. Seven metricswill be used for model evaluation, including accuracy. Expected impact of the study for public health: The CFStogether with highest performing model(s) can serve to identifyimportant specific blood tests that point towards BC, which mayserve as an important BC biomarker. Highest performing model(s)may eventually be used to create an Artificial Intelligence tool toassist clinicians in BC screening and detection.

Download Full-text

Feature Selection Using Extreme Gradient Boosting Bayesian Optimization to upgrade the Classification Performance of Motor Imagery signals for BCI

Journal of Neuroscience Methods ◽

10.1016/j.jneumeth.2021.109425 ◽

2021 ◽

pp. 109425

Author(s):

T Thenmozhi ◽

R Helen

Keyword(s):

Feature Selection ◽

Motor Imagery ◽

Classification Performance ◽

Bayesian Optimization ◽

Gradient Boosting ◽

Extreme Gradient Boosting

Download Full-text

Total Organic Carbon Content Prediction in Lacustrine Shale Using Extreme Gradient Boosting Machine Learning Based on Bayesian Optimization

Geofluids ◽

10.1155/2021/6155663 ◽

2021 ◽

Vol 2021 ◽

pp. 1-18

Author(s):

Xingzhou Liu ◽

Zhi Tian ◽

Chang Chen

Keyword(s):

Machine Learning ◽

Organic Carbon ◽

Total Organic Carbon ◽

Bayesian Optimization ◽

Gradient Boosting ◽

Support Vector ◽

Shale Oil ◽

The Core ◽

Extreme Gradient Boosting ◽

Lacustrine Shale

The total organic carbon (TOC) content is a critical parameter for estimating shale oil resources. However, common TOC prediction methods rely on empirical formulas, and their applicability varies widely from region to region. In this study, a novel data-driven Bayesian optimization extreme gradient boosting (XGBoost) model was proposed to predict the TOC content using wireline log data. The lacustrine shale in the Damintun Sag, Bohai Bay Basin, China, was used as a case study. Firstly, correlation analysis was used to analyze the relationship between the well logs and the core-measured TOC data. Based on the degree of correlation, six logging curves reflecting TOC content were selected to construct training dataset for machine learning. Then, the performance of the XGBoost model was tested using K -fold cross-validation, and the hyperparameters of the model were determined using a Bayesian optimization method to improve the search efficiency and reduce the uncertainty caused by the rule of thumb. Next, through the analysis of prediction errors, the coefficient of determination ( R 2 ) of the TOC content predicted by the XGBoost model and the core-measured TOC content reached 0.9135. The root mean square error (RMSE), mean absolute error (MAE), and mean absolute percentage error (MAPE) were 0.63, 0.77, and 12.55%, respectively. In addition, five commonly used methods, namely, Δ log R method, random forest, support vector machine, K -nearest neighbors, and multiple linear regression, were used to predict the TOC content to confirm that the XGBoost model has higher prediction accuracy and better robustness. Finally, the proposed approach was applied to predict the TOC curves of 20 exploration wells in the Damintun Sag. We obtained quantitative contour maps of the TOC content of this block for the first time. The results of this study facilitate the rapid detection of the sweet spots of the lacustrine shale oil.

Download Full-text

Machine learning imputation of metastatic status from open claims in melanoma patients.

Journal of Clinical Oncology ◽

10.1200/jco.2021.39.15_suppl.e21540 ◽

2021 ◽

Vol 39 (15_suppl) ◽

pp. e21540-e21540

Author(s):

Vivek Prabhakar Vaidya ◽

Rambaksh Prajapati ◽

Sai Vinod Manirevu ◽

Rohini George ◽

Smita Agrawal ◽

...

Keyword(s):

Machine Learning ◽

Claims Data ◽

Model Performance ◽

Ground Truth ◽

Parameter Tuning ◽

Gradient Boosting ◽

Linear Feature ◽

Extreme Gradient Boosting ◽

Secondary Neoplasm ◽

Melanoma Patients

e21540 Background: Metastatic status is a crucial variable in most oncology studies but is not available in claims data. The objective of this study is to develop a machine learning model for Imputation of metastatic status from claims data with ground. Truth is derived from highly curated electronic medical record data. Methods: We used a set of 11389 melanoma patients from the ConcertAI real world database of intersecting claims and EMR data that includes data from CancerLinQ Discovery. Using features from claims and our gold standard labels from EMR we built an ML model using (XGBoost) extreme gradient boosting, an algorithm that iteratively combines a set of decision trees into a single model. We used 60% of the data for training, 20% for hyper-parameter tuning, and 20% for holdout testing. The model was built using 55 features. Results: The table below summarizes results. Metrics are on the final hold out set which was unseen by the model and entirely composed of highly curated EMR data. Conclusions: We are able to build a high precision model for the imputation of metastatic melanoma status using claims data. This could enable significantly better use of claims data stemming from the ability to find a metastatic cohort with very few false positives. Providing more precise cohort identification for comparative effectiveness studies. We found features such as secondary neoplasm diagnosis, anti-neoplastic meds, and radiation ranking highly in our analysis of model feature importances. Using techniques to analyze non-linear feature interactions in our AI model we found an interaction relationship between long term anti-neoplastic therapy, reported pain and metastatic status which we plan to further study. This work is preliminary and we are working to further improve model performance.[Table: see text]

Download Full-text

Regional flood frequency analysis using extreme gradient boosting based on Bayesian optimization.

10.1002/essoar.10509264.1 ◽

2021 ◽

Author(s):

Deva Jarajapu ◽

Rathinasamy Maheswaran

Keyword(s):

Frequency Analysis ◽

Flood Frequency ◽

Bayesian Optimization ◽

Flood Frequency Analysis ◽

Gradient Boosting ◽

Extreme Gradient Boosting ◽

Regional Flood Frequency Analysis ◽

Regional Flood Frequency

Download Full-text

Assess the heart disease risk of the Chinese elderly using a predictive model

Advances in Social Sciences Research Journal ◽

10.14738/assrj.72.7911 ◽

2020 ◽

Vol 7 (2) ◽

pp. 251-262

Author(s):

Yu Fu

Keyword(s):

Heart Disease ◽

Predictive Model ◽

Chronic Diseases ◽

Economic Condition ◽

Disease Risk ◽

Parameter Tuning ◽

Gradient Boosting ◽

Time Range ◽

Data Set ◽

Extreme Gradient Boosting

The accelerating aging process worldwide makes chronic diseases the predominant risk for public health, and heart disease is in the top causes of the mortality of the elderly. Studies have verified the interventions can prevent, reduce or delay the onset of chronic diseases. This paper aims to find the domain predictors of heart disease by applying a machine learning technique Extreme Gradient Boosting to 89 predictors extracting from genetic, lifestyle, economic condition, isolation, stressful life events, nutrition and availability of medical service indexes. The individual-level data used is Chinese Longitudinal Healthy Longevity Survey with the time range of 2000 to 2002, and 2011 to 2014. We apply the imputation and oversampling technique to improve the prediction performance and use a step by step parameter tuning process to get the best hyper-parameters needed in the modeling. The fitted predictive model reaches a prediction accuracy of above 90% in the independent test data set. Comparing the first investigated period of 2000 to 2002 with the second period of 2011 to 2014, the predictors associated with economic condition play an important role in the prediction. The nutrition factor, surprisingly, does not contribute significantly to the prediction capability.

Download Full-text

A Data-Driven Approach for Lithology Identification Based on Parameter-Optimized Ensemble Learning

Energies ◽

10.3390/en13153903 ◽

2020 ◽

Vol 13 (15) ◽

pp. 3903

Author(s):

Zhixue Sun ◽

Baosheng Jiang ◽

Xiangling Li ◽

Jikang Li ◽

Kang Xiao

Keyword(s):

Characteristic Curve ◽

Confusion Matrix ◽

Petroleum Exploration ◽

Bayesian Optimization ◽

Gradient Boosting ◽

Reservoir Properties ◽

Gas Field ◽

Lithology Identification ◽

Extreme Gradient Boosting ◽

Data Driven Approach

The identification of underground formation lithology can serve as a basis for petroleum exploration and development. This study integrates Extreme Gradient Boosting (XGBoost) with Bayesian Optimization (BO) for formation lithology identification and comprehensively evaluated the performance of the proposed classifier based on the metrics of the confusion matrix, precision, recall, F1-score and the area under the receiver operating characteristic curve (AUC). The data of this study are derived from Daniudui gas field and the Hangjinqi gas field, which includes 2153 samples with known lithology facies class with each sample having seven measured properties (well log curves), and corresponding depth. The results show that BO significantly improves parameter optimization efficiency. The AUC values of the test sets of the two gas fields are 0.968 and 0.987, respectively, indicating that the proposed method has very high generalization performance. Additionally, we compare the proposed algorithm with Gradient Tree Boosting-Differential Evolution (GTB-DE) using the same dataset. The results demonstrated that the average of precision, recall and F1 score of the proposed method are respectively 4.85%, 5.7%, 3.25% greater than GTB-ED. The proposed XGBoost-BO ensemble model can automate the procedure of lithology identification, and it may also be used in the prediction of other reservoir properties.

Download Full-text

Traffic Incident Clearance Time Prediction and Influencing Factor Analysis Using Extreme Gradient Boosting Model

Journal of Advanced Transportation ◽

10.1155/2020/6401082 ◽

2020 ◽

Vol 2020 ◽

pp. 1-12

Author(s):

Jinjun Tang ◽

Lanlan Zheng ◽

Chunyang Han ◽

Fang Liu ◽

Jianming Cai

Keyword(s):

Factor Analysis ◽

Dimensional Space ◽

Tracking System ◽

Bayesian Optimization ◽

Gradient Boosting ◽

Incident Management ◽

Clearance Time ◽

Explanatory Variables ◽

Traffic Incident ◽

Extreme Gradient Boosting

Accurate prediction and reliable significant factor analysis of incident clearance time are two main objects of traffic incident management (TIM) system, as it could help to relieve traffic congestion caused by traffic incidents. This study applies the extreme gradient boosting machine algorithm (XGBoost) to predict incident clearance time on freeway and analyze the significant factors of clearance time. The XGBoost integrates the superiority of statistical and machine learning methods, which can flexibly deal with the nonlinear data in high-dimensional space and quantify the relative importance of the explanatory variables. The data collected from the Washington Incident Tracking System in 2011 are used in this research. To investigate the potential philosophy hidden in data, K-means is chosen to cluster the data into two clusters. The XGBoost is built for each cluster. Bayesian optimization is used to optimize the parameters of XGBoost, and the MAPE is considered as the predictive indicator to evaluate the prediction performance. A comparative study confirms that the XGBoost outperforms other models. In addition, response time, AADT (annual average daily traffic), incident type, and lane closure type are identified as the significant explanatory variables for clearance time.

Download Full-text