Comparison of machine learning methods in predicting binary and multi-class occupational accident severity

Future Events ◽

Taking Action ◽

Significant Factors

Although Machine Learning (ML) is widely used to examine hidden patterns in complex databases and learn from them to predict future events in many fields, utilization of it for predicting the outcome of occupational accidents is relatively sparse. This study utilized diversified ML algorithms; Multinomial Logistic Regression (MLR), Support Vector Machines (SVM), Single C5.0 Tree (C5), Stochastic Gradient Boosting (SGB), and Neural Network (NN) in classifying the severity of occupational accidents in binary (Fatal/NonFatal) and multi-class (Fatal/Major/Minor) outcomes. Comparison of the performance of models showed Balanced Accuracy to be the best for SVM and SGB methods in 2-Class and SGB in 3-Class. Algorithms performed better at predicting fatal accidents compared to major and minor accidents. Results obtained revealed that, ML unveils factors contributing to severity to better address the corrective actions. Furthermore, taking action related to even some of the most significant factors in complex accidents database with many attributes can prevent majority of severe accidents. Interpretation of most significant factors identified for accident prediction suggest the following corrective measures: taking fall prevention actions, prioritizing workplace inspections based on the number of employees, and supplementing safety actions according to worker’s age and experience.

Machine learning as a successful approach for predicting complex spatio–temporal patterns in animal species abundance

Animal Biodiversity and Conservation ◽

10.32800/abc.2021.44.0289 ◽

2021 ◽

pp. 289-301

Author(s):

B. Martín ◽

J. González–Arias ◽

J. A. Vicente–Vírseda

Keyword(s):

Machine Learning ◽

Random Forest ◽

Animal Species ◽

Temporal Patterns ◽

Additive Models ◽

Gradient Boosting ◽

Support Vector ◽

Extreme Gradient Boosting ◽

Spatio Temporal

Our aim was to identify an optimal analytical approach for accurately predicting complex spatio–temporal patterns in animal species distribution. We compared the performance of eight modelling techniques (generalized additive models, regression trees, bagged CART, k–nearest neighbors, stochastic gradient boosting, support vector machines, neural network, and random forest –enhanced form of bootstrap. We also performed extreme gradient boosting –an enhanced form of radiant boosting– to predict spatial patterns in abundance of migrating Balearic shearwaters based on data gathered within eBird. Derived from open–source datasets, proxies of frontal systems and ocean productivity domains that have been previously used to characterize the oceanographic habitats of seabirds were quantified, and then used as predictors in the models. The random forest model showed the best performance according to the parameters assessed (RMSE value and R2). The correlation between observed and predicted abundance with this model was also considerably high. This study shows that the combination of machine learning techniques and massive data provided by open data sources is a useful approach for identifying the long–term spatial–temporal distribution of species at regional spatial scales.

Mapping of the Canopy Openings in Mixed Beech–Fir Forest at Sentinel-2 Subpixel Level Using UAV and Machine Learning Approach

Remote Sensing ◽

10.3390/rs12233925 ◽

2020 ◽

Vol 12 (23) ◽

pp. 3925

Author(s):

Ivan Pilaš ◽

Mateo Gašparović ◽

Alan Novkinić ◽

Damir Klobučar

Keyword(s):

Machine Learning ◽

Forest Canopy ◽

Vegetation Index ◽

Predictive Performance ◽

Spatial Extent ◽

Gradient Boosting ◽

Support Vector ◽

Extreme Gradient Boosting ◽

Sentinel 2

The presented study demonstrates a bi-sensor approach suitable for rapid and precise up-to-date mapping of forest canopy gaps for the larger spatial extent. The approach makes use of Unmanned Aerial Vehicle (UAV) red, green and blue (RGB) images on smaller areas for highly precise forest canopy mask creation. Sentinel-2 was used as a scaling platform for transferring information from the UAV to a wider spatial extent. Various approaches to an improvement in the predictive performance were examined: (I) the highest R2 of the single satellite index was 0.57, (II) the highest R2 using multiple features obtained from the single-date, S-2 image was 0.624, and (III) the highest R2 on the multitemporal set of S-2 images was 0.697. Satellite indices such as Atmospherically Resistant Vegetation Index (ARVI), Infrared Percentage Vegetation Index (IPVI), Normalized Difference Index (NDI45), Pigment-Specific Simple Ratio Index (PSSRa), Modified Chlorophyll Absorption Ratio Index (MCARI), Color Index (CI), Redness Index (RI), and Normalized Difference Turbidity Index (NDTI) were the dominant predictors in most of the Machine Learning (ML) algorithms. The more complex ML algorithms such as the Support Vector Machines (SVM), Random Forest (RF), Stochastic Gradient Boosting (GBM), Extreme Gradient Boosting (XGBoost), and Catboost that provided the best performance on the training set exhibited weaker generalization capabilities. Therefore, a simpler and more robust Elastic Net (ENET) algorithm was chosen for the final map creation.

Reliable photometric membership (RPM) of galaxies in clusters – I. A machine learning method and its performance in the local universe

Monthly Notices of the Royal Astronomical Society ◽

10.1093/mnras/staa486 ◽

2020 ◽

Vol 493 (3) ◽

pp. 3429-3441

Author(s):

Paulo A A Lopes ◽

André L B Ribeiro

Keyword(s):

Machine Learning ◽

Galaxy Evolution ◽

Large Scale ◽

Machine Learning Techniques ◽

Gradient Boosting ◽

Support Vector ◽

Validation Data ◽

Membership Probability ◽

Cluster Membership ◽

ABSTRACT We introduce a new method to determine galaxy cluster membership based solely on photometric properties. We adopt a machine learning approach to recover a cluster membership probability from galaxy photometric parameters and finally derive a membership classification. After testing several machine learning techniques (such as stochastic gradient boosting, model averaged neural network and k-nearest neighbours), we found the support vector machine algorithm to perform better when applied to our data. Our training and validation data are from the Sloan Digital Sky Survey main sample. Hence, to be complete to $M_r^* + 3$, we limit our work to 30 clusters with $z$phot-cl ≤ 0.045. Masses (M200) are larger than $\sim 0.6\times 10^{14} \, \mathrm{M}_{\odot }$ (most above $3\times 10^{14} \, \mathrm{M}_{\odot }$). Our results are derived taking in account all galaxies in the line of sight of each cluster, with no photometric redshift cuts or background corrections. Our method is non-parametric, making no assumptions on the number density or luminosity profiles of galaxies in clusters. Our approach delivers extremely accurate results (completeness, C $\sim 92{\rm{ per\ cent}}$ and purity, P $\sim 87{\rm{ per\ cent}}$) within R200, so that we named our code reliable photometric membership. We discuss possible dependencies on magnitude, colour, and cluster mass. Finally, we present some applications of our method, stressing its impact to galaxy evolution and cosmological studies based on future large-scale surveys, such as eROSITA, EUCLID, and LSST.

International Journal of Innovative Technology and Exploring Engineering - Special Issue ◽

Predict Health Insurance Cost by using Machine Learning and DNN Regression Models

10.35940/ijitee.c8364.0110321 ◽

2021 ◽

Vol 10 (2) ◽

pp. 137-143

Author(s):

Mohamed hanafy ◽

Omar M. A. Mahmoud

Keyword(s):

Machine Learning ◽

Insurance Industry ◽

Additive Model ◽

Policy Formulation ◽

Stochastic Gradient ◽

Gradient Boosting ◽

Support Vector ◽

K Nearest Neighbors ◽

Insurance Cost

Insurance is a policy that eliminates or decreases loss costs occurred by various risks. Various factors influence the cost of insurance. These considerations contribute to the insurance policy formulation. Machine learning (ML) for the insurance industry sector can make the wording of insurance policies more efficient. This study demonstrates how different models of regression can forecast insurance costs. And we will compare the results of models, for example, Multiple Linear Regression, Generalized Additive Model, Support Vector Machine, Random Forest Regressor, CART, XGBoost, k-Nearest Neighbors, Stochastic Gradient Boosting, and Deep Neural Network. This paper offers the best approach to the Stochastic Gradient Boosting model with an MAE value of 0.17448, RMSE value of 0.38018and R -squared value of 85.8295.

Classifier Selection for the Prediction of Dominant Transmission Mode of Coronavirus Within Localities

International Journal of E-Health and Medical Communications ◽

10.4018/ijehmc.20211101.oa1 ◽

2021 ◽

Vol 12 (6) ◽

pp. 1-12

Author(s):

Donald Douglas Atsa'am ◽

Ruth Wario

Keyword(s):

Predictive Accuracy ◽

Multinomial Logistic Regression ◽

Geographic Area ◽

Stochastic Gradient ◽

Transmission Mode ◽

Gradient Boosting ◽

Support Vector ◽

Linear Discriminant ◽

Classifier Selection ◽

The coronavirus disease-2019 (COVID-19) pandemic is an ongoing concern that requires research in all disciplines to tame its spread. Nine classification algorithms were selected for evaluating the most appropriate in predicting the prevalent COVID-19 transmission mode in a geographic area. These include; multinomial logistic regression, k-nearest neighbour, support vector machines, linear discriminant analysis, naïve Bayes, C5.0, bagged classification and regression trees, random forest, and stochastic gradient boosting. Five COVID-19 datasets were employed for classification. Predictive accuracy was determined using 10-fold cross validation with three repeats. The Friedman’s test was conducted and the outcome showed the performance of each algorithm is significantly different. The stochastic gradient boosting yielded the highest predictive accuracy, 81%. This finding should be valuable to health informaticians, health analysts and others regarding which machine learning tool to adopt in the efforts to detect dominant transmission mode of the virus within localities.

Predicting Safe Parking Spaces: A Machine Learning Approach to Geospatial Urban and Crime Data

Sustainability ◽

10.3390/su11102848 ◽

2019 ◽

Vol 11 (10) ◽

pp. 2848 ◽

Cited By ~ 1

Author(s):

Irina Matijosaitiene ◽

Anthony McDowald ◽

Vishal Juneja

Keyword(s):

Machine Learning ◽

Linear Regression ◽

Prediction Model ◽

Linear Models ◽

Hot Spot ◽

Elastic Net ◽

Motor Vehicles ◽

Gradient Boosting ◽

Support Vector ◽

This research aims to identify spatial and time patterns of theft in Manhattan, NY, to reveal urban factors that contribute to thefts from motor vehicles and to build a prediction model for thefts. Methods include time series and hot spot analysis, linear regression, elastic-net, Support vector machines SVM with radial and linear kernels, decision tree, bagged CART, random forest, and stochastic gradient boosting. Machine learning methods reveal that linear models perform better on our data (linear regression, elastic-net), specifying that a higher number of subway entrances, graffiti, and restaurants on streets contribute to higher theft rates from motor vehicles. Although the prediction model for thefts meets almost all assumptions (five of six), its accuracy is 77%, suggesting that there are other undiscovered factors making a contribution to the generation of thefts. As an output demonstrating final results, the application prototype for searching safer parking in Manhattan, NY based on the prediction model, has been developed.

A Study on Data Pre-Processing and Accident Prediction Modelling for Occupational Accident Analysis in the Construction Industry

Applied Sciences ◽

10.3390/app10217949 ◽

2020 ◽

Vol 10 (21) ◽

pp. 7949

Author(s):

Jae Yun Lee ◽

Young Geun Yoon ◽

Tae Keun Oh ◽

Seunghee Park ◽

Sang Il Ryu

Keyword(s):

Machine Learning ◽

Construction Industry ◽

Flow Diagram ◽

Occupational Accidents ◽

Support Vector ◽

Occupational Accident ◽

Accident Prediction ◽

Processing Procedure ◽

Accident Data ◽

Prediction Modelling

In the construction industry, it is difficult to predict occupational accidents because various accident characteristics arise simultaneously and organically in different types of work. Furthermore, even when analyzing occupational accident data, it is difficult to deduce meaningful results because the data recorded by the incident investigator are qualitative and include a wide variety of data types and categories. Recently, numerous studies have used machine learning to analyze the correlations in such complex construction accident data; however, heretofore the focus has been on predicting severity with various variables, and several limitations remain when deriving the correlations between features from various variables. Thus, this paper proposes a data processing procedure that can efficiently manipulate accident data using optimal machine learning techniques and derive and systematize meaningful variables to rationally approach such complex problems. In particular, among the various variables, the most influential variables are derived through methods such as clustering, chi-square, Cramer’s V, and predictor importance; then, the analysis is simplified by optimally grouping the variables. For accident data with optimal variables and elements, a predictive model is constructed between variables, using a support vector machine and decision-tree-based ensemble; then, the correlation between the dependent and independent variables is analyzed through an alluvial flow diagram for several cases. Therefore, a new processing procedure has been introduced in data preprocessing and accident prediction modelling to overcome difficulties from complex and diverse construction occupational accident data, and effective accident prevention is possible by deriving correlations of construction accidents using this process.

Classifier Selection for the Prediction of Dominant Transmission Mode of Coronavirus within Localities

International Journal of E-Health and Medical Communications ◽

10.4018/ijehmc.20211101oa02 ◽

2021 ◽

Vol 12 (6) ◽

pp. 0-0

Keyword(s):

Predictive Accuracy ◽

Multinomial Logistic Regression ◽

Geographic Area ◽

Stochastic Gradient ◽

Transmission Mode ◽

Gradient Boosting ◽

Support Vector ◽

Linear Discriminant ◽

Classifier Selection ◽

Prediction of E. coli Concentrations in Agricultural Pond Waters: Application and Comparison of Machine Learning Algorithms

Frontiers in Artificial Intelligence ◽

10.3389/frai.2021.768650 ◽

2022 ◽

Vol 4 ◽

Author(s):

Matthew D. Stocker ◽

Yakov A. Pachepsky ◽

Robert L. Hill

Keyword(s):

Machine Learning ◽

Water Quality ◽

Quality Parameters ◽

Machine Learning Algorithms ◽

Water Quality Parameters ◽

Gradient Boosting ◽

Support Vector ◽

E Coli ◽

Significant Difference

The microbial quality of irrigation water is an important issue as the use of contaminated waters has been linked to several foodborne outbreaks. To expedite microbial water quality determinations, many researchers estimate concentrations of the microbial contamination indicator Escherichia coli (E. coli) from the concentrations of physiochemical water quality parameters. However, these relationships are often non-linear and exhibit changes above or below certain threshold values. Machine learning (ML) algorithms have been shown to make accurate predictions in datasets with complex relationships. The purpose of this work was to evaluate several ML models for the prediction of E. coli in agricultural pond waters. Two ponds in Maryland were monitored from 2016 to 2018 during the irrigation season. E. coli concentrations along with 12 other water quality parameters were measured in water samples. The resulting datasets were used to predict E. coli using stochastic gradient boosting (SGB) machines, random forest (RF), support vector machines (SVM), and k-nearest neighbor (kNN) algorithms. The RF model provided the lowest RMSE value for predicted E. coli concentrations in both ponds in individual years and over consecutive years in almost all cases. For individual years, the RMSE of the predicted E. coli concentrations (log10 CFU 100 ml−1) ranged from 0.244 to 0.346 and 0.304 to 0.418 for Pond 1 and 2, respectively. For the 3-year datasets, these values were 0.334 and 0.381 for Pond 1 and 2, respectively. In most cases there was no significant difference (P > 0.05) between the RMSE of RF and other ML models when these RMSE were treated as statistics derived from 10-fold cross-validation performed with five repeats. Important E. coli predictors were turbidity, dissolved organic matter content, specific conductance, chlorophyll concentration, and temperature. Model predictive performance did not significantly differ when 5 predictors were used vs. 8 or 12, indicating that more tedious and costly measurements provide no substantial improvement in the predictive accuracy of the evaluated algorithms.

Machine learning in predicting immediate and long-term outcomes of myocardial revascularization: a systematic review

Russian Journal of Cardiology ◽

10.15829/1560-4071-2021-4505 ◽

2021 ◽

Vol 26 (8) ◽

pp. 4505

Author(s):

B. I. Geltser ◽

V. Yu. Rublev ◽

M. M. Tsivanyuk ◽

K. I. Shakhgeldyan

Keyword(s):

Machine Learning ◽

Systematic Review ◽

Artery Bypass ◽

Myocardial Revascularization ◽

Clinical Decision Support Systems ◽

Coronary Intervention ◽

Gradient Boosting ◽

Support Vector ◽

Machine learning (ML) is among the main tools of artificial intelligence and are increasingly used in population and clinical cardiology to stratify cardiovascular risk. The systematic review presents an analysis of literature on using various ML methods (artificial neural networks, random forest, stochastic gradient boosting, support vector machines, etc.) to develop predictive models determining the immediate and long-term risk of adverse events after coronary artery bypass grafting and percutaneous coronary intervention. Most of the research on this issue is focused on creation of novel forecast models with a higher predictive value. It is emphasized that the improvement of modeling technologies and the development of clinical decision support systems is one of the most promising areas of digitalizing healthcare that are in demand in everyday professional activities.