Predicting the Tool Wear of a Drilling Process Using Novel Machine Learning XGBoost-SDA

Tool wear negatively impacts the quality of workpieces produced by the drilling process. Accurate prediction of tool wear enables the operator to maintain the machine at the required level of performance. This research presents a novel hybrid machine learning approach for predicting the tool wear in a drilling process. The proposed approach is based on optimizing the extreme gradient boosting algorithm’s hyperparameters by a spiral dynamic optimization algorithm (XGBoost-SDA). Simulations were carried out on copper and cast-iron datasets with a high degree of accuracy. Further comparative analyses were performed with support vector machines (SVM) and multilayer perceptron artificial neural networks (MLP-ANN), where XGBoost-SDA showed superior performance with regard to the method. Simulations revealed that XGBoost-SDA results in the accurate prediction of flank wear in the drilling process with mean absolute error (MAE) = 4.67%, MAE = 5.32%, and coefficient of determination R2 = 0.9973 for the copper workpiece. Similarly, for the cast iron workpiece, XGBoost-SDA resulted in surface roughness predictions with MAE = 5.25%, root mean square error (RMSE) = 6.49%, and R2 = 0.975, which closely agree with the measured values. Performance comparisons between SVM, MLP-ANN, and XGBoost-SDA show that XGBoost-SDA is an effective method that can ensure high predictive accuracy about flank wear values in a drilling process.

Download Full-text

iPMI: Machine Learning-Aided Identification of Parametrial Invasion in Women with Early-Stage Cervical Cancer

Diagnostics ◽

10.3390/diagnostics11081454 ◽

2021 ◽

Vol 11 (8) ◽

pp. 1454

Author(s):

Phasit Charoenkwan ◽

Watshara Shoombuatong ◽

Chalaithorn Nantasupha ◽

Tanarat Muangmool ◽

Prapaporn Suprasert ◽

...

Keyword(s):

Machine Learning ◽

Cervical Cancer ◽

Early Stage ◽

Superior Performance ◽

Gradient Boosting ◽

Support Vector ◽

Efficient System ◽

Extreme Gradient Boosting ◽

Independent Test ◽

Early Stage Cervical Cancer

Radical hysterectomy is a recommended treatment for early-stage cervical cancer. However, the procedure is associated with significant morbidities resulting from the removal of the parametrium. Parametrial cancer invasion (PMI) is found in a minority of patients but the efficient system used to predict it is lacking. In this study, we develop a novel machine learning (ML)-based predictive model based on a random forest model (called iPMI) for the practical identification of PMI in women. Data of 1112 stage IA-IIA cervical cancer patients who underwent primary surgery were collected and considered as the training dataset, while data from an independent cohort of 116 consecutive patients were used as the independent test dataset. Based on these datasets, iPMI-Econ was then developed by using basic clinicopathological data available prior to surgery, while iPMI-Power was also introduced by adding pelvic node metastasis and uterine corpus invasion to the iPMI-Econ. Both 10-fold cross-validations and independent test results showed that iPMI-Power outperformed other well-known ML classifiers (e.g., logistic regression, decision tree, k-nearest neighbor, multi-layer perceptron, naive Bayes, support vector machine, and extreme gradient boosting). Upon comparison, it was found that iPMI-Power was effective and had a superior performance to other well-known ML classifiers in predicting PMI. It is anticipated that the proposed iPMI may serve as a cost-effective and rapid approach to guide important clinical decision-making.

Download Full-text

Machine learning models to identify low adherence to influenza vaccination among Korean adults with cardiovascular disease

BMC Cardiovascular Disorders ◽

10.1186/s12872-021-01925-7 ◽

2021 ◽

Vol 21 (1) ◽

Author(s):

Moojung Kim ◽

Young Jae Kim ◽

Sung Jin Park ◽

Kwang Gi Kim ◽

Pyung Chun Oh ◽

...

Keyword(s):

Machine Learning ◽

Cardiovascular Disease ◽

Influenza Vaccination ◽

Machine Learning Techniques ◽

Gradient Boosting ◽

Support Vector ◽

Age Group ◽

Learning Models ◽

Extreme Gradient Boosting ◽

Machine Learning Models

Abstract Background Annual influenza vaccination is an important public health measure to prevent influenza infections and is strongly recommended for cardiovascular disease (CVD) patients, especially in the current coronavirus disease 2019 (COVID-19) pandemic. The aim of this study is to develop a machine learning model to identify Korean adult CVD patients with low adherence to influenza vaccination Methods Adults with CVD (n = 815) from a nationally representative dataset of the Fifth Korea National Health and Nutrition Examination Survey (KNHANES V) were analyzed. Among these adults, 500 (61.4%) had answered "yes" to whether they had received seasonal influenza vaccinations in the past 12 months. The classification process was performed using the logistic regression (LR), random forest (RF), support vector machine (SVM), and extreme gradient boosting (XGB) machine learning techniques. Because the Ministry of Health and Welfare in Korea offers free influenza immunization for the elderly, separate models were developed for the < 65 and ≥ 65 age groups. Results The accuracy of machine learning models using 16 variables as predictors of low influenza vaccination adherence was compared; for the ≥ 65 age group, XGB (84.7%) and RF (84.7%) have the best accuracies, followed by LR (82.7%) and SVM (77.6%). For the < 65 age group, SVM has the best accuracy (68.4%), followed by RF (64.9%), LR (63.2%), and XGB (61.4%). Conclusions The machine leaning models show comparable performance in classifying adult CVD patients with low adherence to influenza vaccination.

Download Full-text

Explainable machine learning can outperform Cox regression predictions and provide insights in breast cancer survival

Scientific Reports ◽

10.1038/s41598-021-86327-7 ◽

2021 ◽

Vol 11 (1) ◽

Author(s):

Arturo Moncada-Torres ◽

Marissa C. van Maaren ◽

Mathijs P. Hendriks ◽

Sabine Siesling ◽

Gijs Geleijnse

Keyword(s):

Breast Cancer ◽

Machine Learning ◽

Explicit Knowledge ◽

Cox Regression ◽

Metastatic Breast ◽

Gradient Boosting ◽

Support Vector ◽

Netherlands Cancer Registry ◽

Extreme Gradient Boosting ◽

The Impact

AbstractCox Proportional Hazards (CPH) analysis is the standard for survival analysis in oncology. Recently, several machine learning (ML) techniques have been adapted for this task. Although they have shown to yield results at least as good as classical methods, they are often disregarded because of their lack of transparency and little to no explainability, which are key for their adoption in clinical settings. In this paper, we used data from the Netherlands Cancer Registry of 36,658 non-metastatic breast cancer patients to compare the performance of CPH with ML techniques (Random Survival Forests, Survival Support Vector Machines, and Extreme Gradient Boosting [XGB]) in predicting survival using the $$c$$ c -index. We demonstrated that in our dataset, ML-based models can perform at least as good as the classical CPH regression ($$c$$ c -index $$\sim \,0.63$$ ∼ 0.63 ), and in the case of XGB even better ($$c$$ c -index $$\sim 0.73$$ ∼ 0.73 ). Furthermore, we used Shapley Additive Explanation (SHAP) values to explain the models’ predictions. We concluded that the difference in performance can be attributed to XGB’s ability to model nonlinearities and complex interactions. We also investigated the impact of specific features on the models’ predictions as well as their corresponding insights. Lastly, we showed that explainable ML can generate explicit knowledge of how models make their predictions, which is crucial in increasing the trust and adoption of innovative ML techniques in oncology and healthcare overall.

Download Full-text

Exploring the Mechanism of Crashes with Autonomous Vehicles Using Machine Learning

Mathematical Problems in Engineering ◽

10.1155/2021/5524356 ◽

2021 ◽

Vol 2021 ◽

pp. 1-10

Author(s):

Hengrui Chen ◽

Hong Chen ◽

Ruiyu Zhou ◽

Zhizhen Liu ◽

Xiaoke Sun

Keyword(s):

Machine Learning ◽

Autonomous Vehicles ◽

Classification And Regression Tree ◽

Gradient Boosting ◽

Support Vector ◽

Crash Severity ◽

Apriori Algorithm ◽

Driving Mode ◽

Extreme Gradient Boosting ◽

The Impact

The safety issue has become a critical obstacle that cannot be ignored in the marketization of autonomous vehicles (AVs). The objective of this study is to explore the mechanism of AV-involved crashes and analyze the impact of each feature on crash severity. We use the Apriori algorithm to explore the causal relationship between multiple factors to explore the mechanism of crashes. We use various machine learning models, including support vector machine (SVM), classification and regression tree (CART), and eXtreme Gradient Boosting (XGBoost), to analyze the crash severity. Besides, we apply the Shapley Additive Explanations (SHAP) to interpret the importance of each factor. The results indicate that XGBoost obtains the best result (recall = 75%; G-mean = 67.82%). Both XGBoost and Apriori algorithm effectively provided meaningful insights about AV-involved crash characteristics and their relationship. Among all these features, vehicle damage, weather conditions, accident location, and driving mode are the most critical features. We found that most rear-end crashes are conventional vehicles bumping into the rear of AVs. Drivers should be extremely cautious when driving in fog, snow, and insufficient light. Besides, drivers should be careful when driving near intersections, especially in the autonomous driving mode.

Download Full-text

Machine learning as a successful approach for predicting complex spatio–temporal patterns in animal species abundance

Animal Biodiversity and Conservation ◽

10.32800/abc.2021.44.0289 ◽

2021 ◽

pp. 289-301

Author(s):

B. Martín ◽

J. González–Arias ◽

J. A. Vicente–Vírseda

Keyword(s):

Machine Learning ◽

Random Forest ◽

Animal Species ◽

Temporal Patterns ◽

Additive Models ◽

Gradient Boosting ◽

Support Vector ◽

Stochastic Gradient Boosting ◽

Extreme Gradient Boosting ◽

Spatio Temporal

Our aim was to identify an optimal analytical approach for accurately predicting complex spatio–temporal patterns in animal species distribution. We compared the performance of eight modelling techniques (generalized additive models, regression trees, bagged CART, k–nearest neighbors, stochastic gradient boosting, support vector machines, neural network, and random forest –enhanced form of bootstrap. We also performed extreme gradient boosting –an enhanced form of radiant boosting– to predict spatial patterns in abundance of migrating Balearic shearwaters based on data gathered within eBird. Derived from open–source datasets, proxies of frontal systems and ocean productivity domains that have been previously used to characterize the oceanographic habitats of seabirds were quantified, and then used as predictors in the models. The random forest model showed the best performance according to the parameters assessed (RMSE value and R2). The correlation between observed and predicted abundance with this model was also considerably high. This study shows that the combination of machine learning techniques and massive data provided by open data sources is a useful approach for identifying the long–term spatial–temporal distribution of species at regional spatial scales.

Download Full-text

Interpretable Machine Learning for Early Neurological Deterioration Prediction in Atrial Fibrillation-Related Stroke

10.21203/rs.3.rs-446890/v1 ◽

2021 ◽

Author(s):

Seong Hwan Kim ◽

Eun-Tae Jeon ◽

Sungwook Yu ◽

Kyungmi O ◽

Chi Kyung Kim ◽

...

Keyword(s):

Machine Learning ◽

Atrial Fibrillation ◽

Neurological Deterioration ◽

Gradient Boosting ◽

Support Vector ◽

Light Gradient ◽

Interpretable Machine Learning ◽

Extreme Gradient Boosting ◽

Early Neurological Deterioration ◽

Feature Importance

Abstract We aimed to develop a novel prediction model for early neurological deterioration (END) based on an interpretable machine learning (ML) algorithm for atrial fibrillation (AF)-related stroke and to evaluate the prediction accuracy and feature importance of ML models. Data from multi-center prospective stroke registries in South Korea were collected. After stepwise data preprocessing, we utilized logistic regression, support vector machine, extreme gradient boosting, light gradient boosting machine (LightGBM), and multilayer perceptron models. We used the Shapley additive explanations (SHAP) method to evaluate feature importance. Of the 3,623 stroke patients, the 2,363 who had arrived at the hospital within 24 hours of symptom onset and had available information regarding END were included. Of these, 318 (13.5%) had END. The LightGBM model showed the highest area under the receiver operating characteristic curve (0.778, 95% CI, 0.726 - 0.830). The feature importance analysis revealed that fasting glucose level and the National Institute of Health Stroke Scale score were the most influential factors. Among ML algorithms, the LightGBM model was particularly useful for predicting END, as it revealed new and diverse predictors. Additionally, the SHAP method can be adjusted to individualize the features’ effects on the predictive power of the model.

Download Full-text

Techniques for Detecting Malware Traffic: A Comprehensive Approach to Feature Selection and Classification

International Journal for Research in Applied Science and Engineering Technology ◽

10.22214/ijraset.2021.39088 ◽

2021 ◽

Vol 9 (12) ◽

pp. 1-10

Author(s):

Harsha A K

Keyword(s):

Machine Learning ◽

Feature Selection ◽

Random Forest ◽

Learning Algorithms ◽

Malware Detection ◽

Machine Learning Algorithms ◽

Gradient Boosting ◽

Support Vector ◽

Steady Increase ◽

Extreme Gradient Boosting

Abstract: Since the advent of encryption, there has been a steady increase in malware being transmitted over encrypted networks. Traditional approaches to detect malware like packet content analysis are inefficient in dealing with encrypted data. In the absence of actual packet contents, we can make use of other features like packet size, arrival time, source and destination addresses and other such metadata to detect malware. Such information can be used to train machine learning classifiers in order to classify malicious and benign packets. In this paper, we offer an efficient malware detection approach using classification algorithms in machine learning such as support vector machine, random forest and extreme gradient boosting. We employ an extensive feature selection process to reduce the dimensionality of the chosen dataset. The dataset is then split into training and testing sets. Machine learning algorithms are trained using the training set. These models are then evaluated against the testing set in order to assess their respective performances. We further attempt to tune the hyper parameters of the algorithms, in order to achieve better results. Random forest and extreme gradient boosting algorithms performed exceptionally well in our experiments, resulting in area under the curve values of 0.9928 and 0.9998 respectively. Our work demonstrates that malware traffic can be effectively classified using conventional machine learning algorithms and also shows the importance of dimensionality reduction in such classification problems. Keywords: Malware Detection, Extreme Gradient Boosting, Random Forest, Feature Selection.

Download Full-text

Evaluating Variable Selection and Machine Learning Algorithms for Estimating Forest Heights by Combining Lidar and Hyperspectral Data

ISPRS International Journal of Geo-Information ◽

10.3390/ijgi9090507 ◽

2020 ◽

Vol 9 (9) ◽

pp. 507

Author(s):

Sanjiwana Arjasakusuma ◽

Sandiaga Swahyu Kusuma ◽

Stuart Phinn

Keyword(s):

Machine Learning ◽

Feature Selection ◽

Learning Algorithms ◽

Principal Component ◽

Hyperspectral Data ◽

Machine Learning Algorithms ◽

Gradient Boosting ◽

Support Vector ◽

Forest Height ◽

Extreme Gradient Boosting

Machine learning has been employed for various mapping and modeling tasks using input variables from different sources of remote sensing data. For feature selection involving high- spatial and spectral dimensionality data, various methods have been developed and incorporated into the machine learning framework to ensure an efficient and optimal computational process. This research aims to assess the accuracy of various feature selection and machine learning methods for estimating forest height using AISA (airborne imaging spectrometer for applications) hyperspectral bands (479 bands) and airborne light detection and ranging (lidar) height metrics (36 metrics), alone and combined. Feature selection and dimensionality reduction using Boruta (BO), principal component analysis (PCA), simulated annealing (SA), and genetic algorithm (GA) in combination with machine learning algorithms such as multivariate adaptive regression spline (MARS), extra trees (ET), support vector regression (SVR) with radial basis function, and extreme gradient boosting (XGB) with trees (XGbtree and XGBdart) and linear (XGBlin) classifiers were evaluated. The results demonstrated that the combinations of BO-XGBdart and BO-SVR delivered the best model performance for estimating tropical forest height by combining lidar and hyperspectral data, with R2 = 0.53 and RMSE = 1.7 m (18.4% of nRMSE and 0.046 m of bias) for BO-XGBdart and R2 = 0.51 and RMSE = 1.8 m (15.8% of nRMSE and −0.244 m of bias) for BO-SVR. Our study also demonstrated the effectiveness of BO for variables selection; it could reduce 95% of the data to select the 29 most important variables from the initial 516 variables from lidar metrics and hyperspectral data.

Download Full-text

Exploiting Rules to Enhance Machine Learning in Extracting Information From Multi-Institutional Prostate Pathology Reports

JCO Clinical Cancer Informatics ◽

10.1200/cci.20.00028 ◽

2020 ◽

pp. 865-874

Author(s):

Enrico Santus ◽

Tal Schuster ◽

Amir M. Tahmasebi ◽

Clara Li ◽

Adam Yala ◽

...

Keyword(s):

Machine Learning ◽

Hybrid Systems ◽

High Performance ◽

Feature Model ◽

Training Data ◽

Gradient Boosting ◽

Support Vector ◽

Data Set ◽

Extreme Gradient Boosting ◽

Pathology Reports

PURPOSE Literature on clinical note mining has highlighted the superiority of machine learning (ML) over hand-crafted rules. Nevertheless, most studies assume the availability of large training sets, which is rarely the case. For this reason, in the clinical setting, rules are still common. We suggest 2 methods to leverage the knowledge encoded in pre-existing rules to inform ML decisions and obtain high performance, even with scarce annotations. METHODS We collected 501 prostate pathology reports from 6 American hospitals. Reports were split into 2,711 core segments, annotated with 20 attributes describing the histology, grade, extension, and location of tumors. The data set was split by institutions to generate a cross-institutional evaluation setting. We assessed 4 systems, namely a rule-based approach, an ML model, and 2 hybrid systems integrating the previous methods: a Rule as Feature model and a Classifier Confidence model. Several ML algorithms were tested, including logistic regression (LR), support vector machine (SVM), and eXtreme gradient boosting (XGB). RESULTS When training on data from a single institution, LR lags behind the rules by 3.5% (F1 score: 92.2% v 95.7%). Hybrid models, instead, obtain competitive results, with Classifier Confidence outperforming the rules by +0.5% (96.2%). When a larger amount of data from multiple institutions is used, LR improves by +1.5% over the rules (97.2%), whereas hybrid systems obtain +2.2% for Rule as Feature (97.7%) and +2.6% for Classifier Confidence (98.3%). Replacing LR with SVM or XGB yielded similar performance gains. CONCLUSION We developed methods to use pre-existing handcrafted rules to inform ML algorithms. These hybrid systems obtain better performance than either rules or ML models alone, even when training data are limited.

Download Full-text

Mapping of the Canopy Openings in Mixed Beech–Fir Forest at Sentinel-2 Subpixel Level Using UAV and Machine Learning Approach

Remote Sensing ◽

10.3390/rs12233925 ◽

2020 ◽

Vol 12 (23) ◽

pp. 3925

Author(s):

Ivan Pilaš ◽

Mateo Gašparović ◽

Alan Novkinić ◽

Damir Klobučar

Keyword(s):

Machine Learning ◽

Forest Canopy ◽

Vegetation Index ◽

Predictive Performance ◽

Spatial Extent ◽

Gradient Boosting ◽

Support Vector ◽

Stochastic Gradient Boosting ◽

Extreme Gradient Boosting ◽

Sentinel 2

The presented study demonstrates a bi-sensor approach suitable for rapid and precise up-to-date mapping of forest canopy gaps for the larger spatial extent. The approach makes use of Unmanned Aerial Vehicle (UAV) red, green and blue (RGB) images on smaller areas for highly precise forest canopy mask creation. Sentinel-2 was used as a scaling platform for transferring information from the UAV to a wider spatial extent. Various approaches to an improvement in the predictive performance were examined: (I) the highest R2 of the single satellite index was 0.57, (II) the highest R2 using multiple features obtained from the single-date, S-2 image was 0.624, and (III) the highest R2 on the multitemporal set of S-2 images was 0.697. Satellite indices such as Atmospherically Resistant Vegetation Index (ARVI), Infrared Percentage Vegetation Index (IPVI), Normalized Difference Index (NDI45), Pigment-Specific Simple Ratio Index (PSSRa), Modified Chlorophyll Absorption Ratio Index (MCARI), Color Index (CI), Redness Index (RI), and Normalized Difference Turbidity Index (NDTI) were the dominant predictors in most of the Machine Learning (ML) algorithms. The more complex ML algorithms such as the Support Vector Machines (SVM), Random Forest (RF), Stochastic Gradient Boosting (GBM), Extreme Gradient Boosting (XGBoost), and Catboost that provided the best performance on the training set exhibited weaker generalization capabilities. Therefore, a simpler and more robust Elastic Net (ENET) algorithm was chosen for the final map creation.

Download Full-text