A stacked generalization ensemble model for optimization and prediction of the gas well rate of penetration: a case study in Xinjiang

AbstractIn gas drilling operations, the rate of penetration (ROP) parameter has an important influence on drilling costs. Prediction of ROP can optimize the drilling operational parameters and reduce its overall cost. To predict ROP with satisfactory precision, a stacked generalization ensemble model is developed in this paper. Drilling data were collected from a shale gas survey well in Xinjiang, northwestern China. First, Pearson correlation analysis is used for feature selection. Then, a Savitzky-Golay smoothing filter is used to reduce noise in the dataset. In the next stage, we propose a stacked generalization ensemble model that combines six machine learning models: support vector regression (SVR), extremely randomized trees (ET), random forest (RF), gradient boosting machine (GB), light gradient boosting machine (LightGBM) and extreme gradient boosting (XGB). The stacked model generates meta-data from the five models (SVR, ET, RF, GB, LightGBM) to compute ROP predictions using an XGB model. Then, the leave-one-out method is used to verify modeling performance. The performance of the stacked model is better than each single model, with R2 = 0.9568 and root mean square error = 0.4853 m/h achieved on the testing dataset. Hence, the proposed approach will be useful in optimizing gas drilling. Finally, the particle swarm optimization (PSO) algorithm is used to optimize the relevant ROP parameters.

Download Full-text

Establishing a Credit Risk Evaluation System for SMEs Using the Soft Voting Fusion Model

Risks ◽

10.3390/risks9110202 ◽

2021 ◽

Vol 9 (11) ◽

pp. 202

Author(s):

Ge Gao ◽

Hongxin Wang ◽

Pengbin Gao

Keyword(s):

Credit Risk ◽

Evaluation System ◽

Predictive Accuracy ◽

Assessment System ◽

Gradient Boosting ◽

Support Vector ◽

Fusion Model ◽

Light Gradient ◽

Extreme Gradient Boosting ◽

The Government

In China, SMEs are facing financing difficulties, and commercial banks and financial institutions are the main financing channels for SMEs. Thus, a reasonable and efficient credit risk assessment system is important for credit markets. Based on traditional statistical methods and AI technology, a soft voting fusion model, which incorporates logistic regression, support vector machine (SVM), random forest (RF), eXtreme Gradient Boosting (XGBoost), and Light Gradient Boosting Machine (LightGBM), is constructed to improve the predictive accuracy of SMEs’ credit risk. To verify the feasibility and effectiveness of the proposed model, we use data from 123 SMEs nationwide that worked with a Chinese bank from 2016 to 2020, including financial information and default records. The results show that the accuracy of the soft voting fusion model is higher than that of a single machine learning (ML) algorithm, which provides a theoretical basis for the government to control credit risk in the future and offers important references for banks to make credit decisions.

Download Full-text

Classification of Hot Spots using XGBoost and LightGBM Algorithms

International Journal of Engineering and Advanced Technology - Regular Issue ◽

10.35940/ijeat.e9459.069520 ◽

2020 ◽

Vol 9 (5) ◽

pp. 722-724

Keyword(s):

Computational Methods ◽

Protein Interactions ◽

Hot Spots ◽

Cell Metabolism ◽

Pearson Correlation ◽

Classification Performance ◽

Gradient Boosting ◽

Support Vector ◽

Extreme Gradient Boosting ◽

Hub Proteins

Protein-Protein Interactions referred as PPIs perform significant role in biological functions like cell metabolism, immune response, signal transduction etc. Hot spots are small fractions of residues in interfaces and provide substantial binding energy in PPIs. Therefore, identification of hot spots is important to discover and analyze molecular medicines and diseases. The current strategy, alanine scanning isn't pertinent to enormous scope applications since the technique is very costly and tedious. The existing computational methods are poor in classification performance as well as accuracy in prediction. They are concerned with the topological structure and gene expression of hub proteins. The proposed system focuses on hot spots of hub proteins by eliminating redundant as well as highly correlated features using Pearson Correlation Coefficient and Support Vector Machine based feature elimination. Extreme Gradient boosting and LightGBM algorithms are used to ensemble a set of weak classifiers to form a strong classifier. The proposed system shows better accuracy than the existing computational methods. The model can also be used to predict accurate molecular inhibitors for specific PPIs

Download Full-text

Interpretable Machine Learning for Early Neurological Deterioration Prediction in Atrial Fibrillation-Related Stroke

10.21203/rs.3.rs-446890/v1 ◽

2021 ◽

Author(s):

Seong Hwan Kim ◽

Eun-Tae Jeon ◽

Sungwook Yu ◽

Kyungmi O ◽

Chi Kyung Kim ◽

...

Keyword(s):

Machine Learning ◽

Atrial Fibrillation ◽

Neurological Deterioration ◽

Gradient Boosting ◽

Support Vector ◽

Light Gradient ◽

Interpretable Machine Learning ◽

Extreme Gradient Boosting ◽

Early Neurological Deterioration ◽

Feature Importance

Abstract We aimed to develop a novel prediction model for early neurological deterioration (END) based on an interpretable machine learning (ML) algorithm for atrial fibrillation (AF)-related stroke and to evaluate the prediction accuracy and feature importance of ML models. Data from multi-center prospective stroke registries in South Korea were collected. After stepwise data preprocessing, we utilized logistic regression, support vector machine, extreme gradient boosting, light gradient boosting machine (LightGBM), and multilayer perceptron models. We used the Shapley additive explanations (SHAP) method to evaluate feature importance. Of the 3,623 stroke patients, the 2,363 who had arrived at the hospital within 24 hours of symptom onset and had available information regarding END were included. Of these, 318 (13.5%) had END. The LightGBM model showed the highest area under the receiver operating characteristic curve (0.778, 95% CI, 0.726 - 0.830). The feature importance analysis revealed that fasting glucose level and the National Institute of Health Stroke Scale score were the most influential factors. Among ML algorithms, the LightGBM model was particularly useful for predicting END, as it revealed new and diverse predictors. Additionally, the SHAP method can be adjusted to individualize the features’ effects on the predictive power of the model.

Download Full-text

Convolutional Neural Network Classifies Pathological Voice Change in Laryngeal Cancer with High Accuracy

Journal of Clinical Medicine ◽

10.3390/jcm9113415 ◽

2020 ◽

Vol 9 (11) ◽

pp. 3415

Author(s):

HyunBum Kim ◽

Juhyeong Jeon ◽

Yeon Jae Han ◽

YoungHoon Joo ◽

Jonghwan Lee ◽

...

Keyword(s):

Neural Network ◽

Convolutional Neural Network ◽

Sensitivity And Specificity ◽

Laryngeal Cancer ◽

Healthy Subjects ◽

Gradient Boosting ◽

Support Vector ◽

Vowel Sound ◽

Light Gradient ◽

Extreme Gradient Boosting

Voice changes may be the earliest signs in laryngeal cancer. We investigated whether automated voice signal analysis can be used to distinguish patients with laryngeal cancer from healthy subjects. We extracted features using the software package for speech analysis in phonetics (PRAAT) and calculated the Mel-frequency cepstral coefficients (MFCCs) from voice samples of a vowel sound of /a:/. The proposed method was tested with six algorithms: support vector machine (SVM), extreme gradient boosting (XGBoost), light gradient boosted machine (LGBM), artificial neural network (ANN), one-dimensional convolutional neural network (1D-CNN) and two-dimensional convolutional neural network (2D-CNN). Their performances were evaluated in terms of accuracy, sensitivity, and specificity. The result was compared with human performance. A total of four volunteers, two of whom were trained laryngologists, rated the same files. The 1D-CNN showed the highest accuracy of 85% and sensitivity and sensitivity and specificity levels of 78% and 93%. The two laryngologists achieved accuracy of 69.9% but sensitivity levels of 44%. Automated analysis of voice signals could differentiate subjects with laryngeal cancer from those of healthy subjects with higher diagnostic properties than those performed by the four volunteers.

Download Full-text

Predicting Hard Rock Pillar Stability Using GBDT, XGBoost, and LightGBM Algorithms

Mathematics ◽

10.3390/math8050765 ◽

2020 ◽

Vol 8 (5) ◽

pp. 765 ◽

Cited By ~ 6

Author(s):

Weizhang Liang ◽

Suizhi Luo ◽

Guoyan Zhao ◽

Hao Wu

Keyword(s):

Large Scale ◽

Prediction Models ◽

Hard Rock ◽

Gradient Boosting ◽

Pillar Stability ◽

Rock Pillar ◽

Light Gradient ◽

Gradient Boosting Machine ◽

Extreme Gradient Boosting ◽

Hard Rock Mines

Predicting pillar stability is a vital task in hard rock mines as pillar instability can cause large-scale collapse hazards. However, it is challenging because the pillar stability is affected by many factors. With the accumulation of pillar stability cases, machine learning (ML) has shown great potential to predict pillar stability. This study aims to predict hard rock pillar stability using gradient boosting decision tree (GBDT), extreme gradient boosting (XGBoost), and light gradient boosting machine (LightGBM) algorithms. First, 236 cases with five indicators were collected from seven hard rock mines. Afterwards, the hyperparameters of each model were tuned using a five-fold cross validation (CV) approach. Based on the optimal hyperparameters configuration, prediction models were constructed using training set (70% of the data). Finally, the test set (30% of the data) was adopted to evaluate the performance of each model. The precision, recall, and F1 indexes were utilized to analyze prediction results of each level, and the accuracy and their macro average values were used to assess the overall prediction performance. Based on the sensitivity analysis of indicators, the relative importance of each indicator was obtained. In addition, the safety factor approach and other ML algorithms were adopted as comparisons. The results showed that GBDT, XGBoost, and LightGBM algorithms achieved a better comprehensive performance, and their prediction accuracies were 0.8310, 0.8310, and 0.8169, respectively. The average pillar stress and ratio of pillar width to pillar height had the most important influences on prediction results. The proposed methodology can provide a reliable reference for pillar design and stability risk management.

Download Full-text

Protein pKa prediction by tree-based machine learning

10.26434/chemrxiv-2021-4d420 ◽

2021 ◽

Author(s):

Ada Y. Chen ◽

Juyong Lee ◽

Ana Damjanovic ◽

Bernard R. Brooks

Keyword(s):

Machine Learning ◽

Machine Learning Algorithms ◽

Gradient Boosting ◽

Pka Prediction ◽

Light Gradient ◽

Structure Database ◽

Gradient Boosting Machine ◽

Extreme Gradient Boosting ◽

Better Than ◽

Protein Pka

We present four tree-based machine learning models for protein pKa prediction. The four models, Random Forest, Extra Trees, eXtreme Gradient Boosting (XGBoost) and Light Gradient Boosting Machine (LightGBM), were trained on three experimental PDB and pKa datasets, two of which included a notable portion of internal residues. We observed similar performance among the four machine learning algorithms. The best model trained on the largest dataset performs 37% better than the widely used empirical pKa prediction tool PROPKA. The overall RMSE for this model is 0.69, with surface and buried RMSE values being 0.56 and 0.78, respectively, considering six residue types (Asp, Glu, His, Lys, Cys and Tyr), and 0.63 when considering Asp, Glu, His and Lys only. We provide pKa predictions for proteins in human proteome from the AlphaFold Protein Structure Database and observed that 1% of Asp/Glu/Lys residues have highly shifted pKa values close to the physiological pH.

Download Full-text

Landslide Susceptibility Analysis using Gradient Boosting Models: A Case Study in Penang Island, Malaysia

Disaster Advances ◽

10.25303/148da2221 ◽

2021 ◽

pp. 22-37

Author(s):

Han Gao ◽

Pei Shan Fam ◽

Lea Tien Tay ◽

Heng Chin Low

Keyword(s):

Feature Selection ◽

Landslide Susceptibility ◽

Roc Curves ◽

Spatial Prediction ◽

Prediction Performance ◽

Gradient Boosting ◽

Support Vector ◽

Prediction Ability ◽

Light Gradient ◽

Extreme Gradient Boosting

Tree-based gradient boosting (TGB) models gain popularity in various areas due to their powerful prediction ability and fast processing speed. This study aims to compare the landslide spatial prediction performance of TGB models and non-tree-based machine learning (NML) models in Penang Island, Malaysia. Two specific instances of TGB models, eXtreme Gradient Boosting (XGBoost) and Light Gradient Boosting Machine (LightGBM) and two specific instances of NML models, artificial neural network (ANN) and support vector machine (SVM), are applied to make predictions of landslide susceptibility. Feature selection and oversampling techniques are considered to improve the prediction performance as well. The results are analyzed and discussed mainly based on receiver operating characteristic (ROC) curves as well as the area under the curves (AUC). The results show that TGB models give better prediction performance compared to NML models, no matter what the sample size is. The TGB models’ performances are improved when training with the dataset considering either feature selection or oversampling techniques. The highest AUC value of 0.9525 is obtained from the combination of XGBoost and SMOTE. The landslide susceptibility maps (LSMs) produced by XGBoost and LightGBM can provide valuable information in landslide management and mitigation in Penang Island, Malaysia.

Download Full-text

Child’s Target Height Prediction Evolution

Applied Sciences ◽

10.3390/app9245447 ◽

2019 ◽

Vol 9 (24) ◽

pp. 5447 ◽

Cited By ~ 1

Author(s):

João Rala Cordeiro ◽

Octavian Postolache ◽

João C. Ferreira

Keyword(s):

Prediction Accuracy ◽

Population Studies ◽

Gradient Boosting ◽

Target Height ◽

New Approach ◽

Light Gradient ◽

Gradient Boosting Machine ◽

Extreme Gradient Boosting ◽

Height Prediction ◽

Growth Assessment

This study is a contribution for the improvement of healthcare in children and in society generally. This study aims to predict children’s height when they become adults, also known as “target height”, to allow for a better growth assessment and more personalized healthcare. The existing literature describes some existing prediction methods, based on longitudinal population studies and statistical techniques, which with few information resources, are able to produce acceptable results. The challenge of this study is in using a new approach based on machine learning to forecast the target height for children and (eventually) improve the existing height prediction accuracy. The goals of the study were achieved. The extreme gradient boosting regression (XGB) and light gradient boosting machine regression (LightGBM) algorithms achieved considerably better results on the height prediction. The developed model can be usefully applied by pediatricians and other clinical professionals in growth assessment.

Download Full-text

Forecasting the Walking Assistance Rehabilitation Level of Stroke Patients Using Artificial Intelligence

Diagnostics ◽

10.3390/diagnostics11061096 ◽

2021 ◽

Vol 11 (6) ◽

pp. 1096

Author(s):

Kanghyeon Seo ◽

Bokjin Chung ◽

Hamsa Priya Panchaseelan ◽

Taewoo Kim ◽

Hyejung Park ◽

...

Keyword(s):

Predictive Performance ◽

Medical Rehabilitation ◽

Gradient Boosting ◽

Support Vector ◽

Prescription Data ◽

Automated Classification ◽

Classification Models ◽

Light Gradient ◽

Gradient Boosting Machine ◽

Tree Models

Cerebrovascular accidents (CVA) cause a range of impairments in coordination, such as a spectrum of walking impairments ranging from mild gait imbalance to complete loss of mobility. Patients with CVA need personalized approaches tailored to their degree of walking impairment for effective rehabilitation. This paper aims to evaluate the validity of using various machine learning (ML) and deep learning (DL) classification models (support vector machine, Decision Tree, Perceptron, Light Gradient Boosting Machine, AutoGluon, SuperTML, and TabNet) for automated classification of walking assistant devices for CVA patients. We reviewed a total of 383 CVA patients’ (1623 observations) prescription data for eight different walking assistant devices from five hospitals. Among the classification models, the advanced tree-based classification models (LightGBM and tree models in AutoGluon) achieved classification results of over 90% accuracy, recall, precision, and F1-score. In particular, AutoGluon not only presented the highest predictive performance (almost 92% in accuracy, recall, precision, and F1-score, and 86.8% in balanced accuracy) but also demonstrated that the classification performances of the tree-based models were higher than that of the other models on its leaderboard. Therefore, we believe that tree-based classification models have potential as practical diagnosis tools for medical rehabilitation.

Download Full-text

Interpretable Machine Learning Model to Predict Rupture of Small Intracranial Aneurysms and Facilitate Clinical Decision

10.21203/rs.3.rs-1015315/v1 ◽

2021 ◽

Author(s):

WeiGen Xiong ◽

TingTing Chen ◽

ZhiHong Zhao ◽

XueMei Li ◽

YaJie Shan ◽

...

Keyword(s):

Machine Learning ◽

Intracranial Aneurysms ◽

External Validation ◽

Maximum Size ◽

Clinical Decision ◽

Gradient Boosting ◽

Support Vector ◽

Rupture Risk ◽

Light Gradient ◽

Extreme Gradient Boosting

Abstract Estimating the rupture risk of small intracranial aneurysms (IAs) to determine whether to treat is difficult but crucial. We aimed to construct and external validation a convenient machine learning (ML) model for assessing the rupture risk of small IAs.1004 patients with small IAs recruited from two hospitals were included in our retrospective research. The patients at hospital 1 were stratified into training (70%) and internal validation set (30%) randomly, and the patients at hospital 2 were used for external validation. We selected predictive features using the least absolute shrinkage and selection operator (LASSO) method, and constructed five ML models applying diverse algorithms including random forest classifier (RFC), categorical boosting (CatBoost), support vector machine (SVM) with linear kernel, light gradient boosting machine (LightGBM) and extreme gradient boosting (XGBoost). The Shapley Additive Explanations (SHAP) analysis provided interpretation for the best ML model.The training, internal and external validation cohorts included 658, 282, and 64 IAs, respectively. The best performance was presented by SVM as AUC of 0.817 in the internal [95% confidence interval (CI), 0.769-0.866] and 0.893 in the external (95% CI, 0.808-0.979) validation cohorts, overperformed than the PHASES score significantly (all P < 0.001). SHAP analysis showed maximum size, location and irregular shape were the top three important features to predict rupture. Our SVM model based on readily accessible features presented satisfying ability of discrimination in predicting the rupture IAs with small size. Morphological parameters made important contributions to prediction result.

Download Full-text