Machine Learning Classification Models for More Effective Mine Safety Inspections

Author(s):  
Jeremy M. Gernand

The safety of mining in the United States has improved significantly over the past few decades, although it remains one of the more dangerous occupations. Following the Sago mine disaster in January 2006, federal legislation (The Mine Improvement and New Emergency Response {MINER} Act of 2006) tightened regulations and sought to strengthen the authority and safety inspection practices of the Mine Safety and Health Administration (MSHA). While penalties and inspection frequency have increased, understanding of what types of inspection findings are most indicative of serious future incidents is limited. The most effective safety management and oversight could be accomplished by a thorough understanding of what types of infractions or safety inspection findings are most indicative of serious future personnel injuries. However, given the large number of potentially different and unique inspection findings, varied mine characteristics, and types of specific safety incidents, this question is complex in terms of the large number of potentially relevant input parameters. New regulations rely on increasing the frequency and severity of infraction penalties to encourage mining operations to improve worker safety, but without the knowledge of which specific infractions may truly be signaling a dangerous work environment. This paper seeks to inform the question, what types of inspection findings are most indicative of serious future incidents for specific types of mining operations? This analysis utilizes publicly available MSHA databases of cited infractions and reportable incidents. These inspection results are used to train machine learning Classification and Regression Tree (CART) and Random Forest (RF) models that divide the groups of mines into peer groups based on their recent infractions and other defining characteristics with the aim of predicting whether or not a fatal or serious disabling injury is more likely to occur in the following 12-month period. With these characteristics available, additional scrutiny may be appropriately directed at those mining operations at greatest risk of experiencing a worker fatality or disabling injury in the near future. Increased oversight and attention on these mines where workers are at greatest risk may more effectively reduce the likelihood of worker deaths and injuries than increased penalties and inspection frequency alone.

Author(s):  
Jeremy M. Gernand

The safety of mining in the United States has improved significantly over the past few decades, although it remains one of the more dangerous occupations. Following the Sago mine disaster in January 2006, federal legislation (The Mine Improvement and New Emergency Response [MINER] Act of 2006) tightened regulations and sought to strengthen the authority and safety-inspection practices of the Mine Safety and Health Administration (MSHA). While penalties and inspection frequency have increased, understanding of what types of inspection findings are most indicative of serious future incidents is limited. The most effective safety management and oversight could be accomplished by a thorough understanding of what types of infractions or safety inspection findings are most indicative of serious future personnel injuries. However, given the large number of potentially different and unique inspection findings, varied mine characteristics, and types of specific safety incidents, this question is complex in terms of the large number of potentially relevant input parameters. New regulations rely on increasing the frequency and severity of infraction penalties to encourage mining operations to improve worker safety, but without the knowledge of which specific infractions may truly be signaling a dangerous work environment. This paper seeks to inform the question: What types of inspection findings are most indicative of serious future incidents for specific types of mining operations? This analysis utilizes publicly available MSHA databases of cited infractions and reportable incidents. These inspection results are used to train machine learning Classification and Regression Tree (CART) and Random Forest (RF) models that divide the groups of mines into peer groups based on their recent infractions and other defining characteristics with the aim of predicting whether or not a fatal or serious disabling injury is more likely to occur in the following 12-month period. With these characteristics available, additional scrutiny may be appropriately directed at those mining operations at greatest risk of experiencing a worker fatality or disabling injury in the near future. Increased oversight and attention on these mines where workers are at greatest risk may more effectively reduce the likelihood of worker deaths and injuries than increased penalties and inspection frequency alone.


Author(s):  
Anurag Yedla ◽  
Fatemeh Davoudi Kakhki ◽  
Ali Jannesari

Mining is known to be one of the most hazardous occupations in the world. Many serious accidents have occurred worldwide over the years in mining. Although there have been efforts to create a safer work environment for miners, the number of accidents occurring at the mining sites is still significant. Machine learning techniques and predictive analytics are becoming one of the leading resources to create safer work environments in the manufacturing and construction industries. These techniques are leveraged to generate actionable insights to improve decision-making. A large amount of mining safety-related data are available, and machine learning algorithms can be used to analyze the data. The use of machine learning techniques can significantly benefit the mining industry. Decision tree, random forest, and artificial neural networks were implemented to analyze the outcomes of mining accidents. These machine learning models were also used to predict days away from work. An accidents dataset provided by the Mine Safety and Health Administration was used to train the models. The models were trained separately on tabular data and narratives. The use of a synthetic data augmentation technique using word embedding was also investigated to tackle the data imbalance problem. Performance of all the models was compared with the performance of the traditional logistic regression model. The results show that models trained on narratives performed better than the models trained on structured/tabular data in predicting the outcome of the accident. The higher predictive power of the models trained on narratives led to the conclusion that the narratives have additional information relevant to the outcome of injury compared to the tabular entries. The models trained on tabular data had a lower mean squared error compared to the models trained on narratives while predicting the days away from work. The results highlight the importance of predictors, like shift start time, accident time, and mining experience in predicting the days away from work. It was found that the F1 score of all the underrepresented classes except one improved after the use of the data augmentation technique. This approach gave greater insight into the factors influencing the outcome of the accident and days away from work.


2020 ◽  
pp. 009102602097756
Author(s):  
In Gu Kang ◽  
Ben Croft ◽  
Barbara A. Bichelmeyer

This study aims to identify important predictors of turnover intention and to characterize subgroups of U.S. federal employees at high risk for turnover intention. Data were drawn from the 2018 Federal Employee Viewpoint Survey (FEVS, unweighted N = 598,003), a nationally representative sample of U.S. federal employees. Machine learning Classification and Regression Tree (CART) analyses were conducted to predict turnover intention and accounted for sample weights. CART analyses identified six at-risk subgroups. Predictor importance scores showed job satisfaction was the strongest predictor of turnover intention, followed by satisfaction with organization, loyalty, accomplishment, involvement in decisions, likeness to job, satisfaction with promotion opportunities, skill development opportunities, organizational tenure, and pay satisfaction. Consequently, Human Resource (HR) departments should seek to implement comprehensive HR practices to enhance employees’ perceptions on job satisfaction, workplace environments and systems, and favorable organizational policies and supports and make tailored interventions for the at-risk subgroups.


Minerals ◽  
2021 ◽  
Vol 11 (7) ◽  
pp. 776
Author(s):  
Rajive Ganguli ◽  
Preston Miller ◽  
Rambabu Pothina

To achieve the goal of preventing serious injuries and fatalities, it is important for a mine site to analyze site specific mine safety data. The advances in natural language processing (NLP) create an opportunity to develop machine learning (ML) tools to automate analysis of mine health and safety management systems (HSMS) data without requiring experts at every mine site. As a demonstration, nine random forest (RF) models were developed to classify narratives from the Mine Safety and Health Administration (MSHA) database into nine accident types. MSHA accident categories are quite descriptive and are, thus, a proxy for high level understanding of the incidents. A single model developed to classify narratives into a single category was more effective than a single model that classified narratives into different categories. The developed models were then applied to narratives taken from a mine HSMS (non-MSHA), to classify them into MSHA accident categories. About two thirds of the non-MSHA narratives were automatically classified by the RF models. The automatically classified narratives were then evaluated manually. The evaluation showed an accuracy of 96% for automated classifications. The near perfect classification of non-MSHA narratives by MSHA based machine learning models demonstrates that NLP can be a powerful tool to analyze HSMS data.


2021 ◽  
Vol 11 (23) ◽  
pp. 11227
Author(s):  
Arnold Kamis ◽  
Yudan Ding ◽  
Zhenzhen Qu ◽  
Chenchen Zhang

The purpose of this paper is to model the cases of COVID-19 in the United States from 13 March 2020 to 31 May 2020. Our novel contribution is that we have obtained highly accurate models focused on two different regimes, lockdown and reopen, modeling each regime separately. The predictor variables include aggregated individual movement as well as state population density, health rank, climate temperature, and political color. We apply a variety of machine learning methods to each regime: Multiple Regression, Ridge Regression, Elastic Net Regression, Generalized Additive Model, Gradient Boosted Machine, Regression Tree, Neural Network, and Random Forest. We discover that Gradient Boosted Machines are the most accurate in both regimes. The best models achieve a variance explained of 95.2% in the lockdown regime and 99.2% in the reopen regime. We describe the influence of the predictor variables as they change from regime to regime. Notably, we identify individual person movement, as tracked by GPS data, to be an important predictor variable. We conclude that government lockdowns are an extremely important de-densification strategy. Implications and questions for future research are discussed.


2021 ◽  
Vol 9 ◽  
Author(s):  
Manish Pandey ◽  
Aman Arora ◽  
Alireza Arabameri ◽  
Romulus Costache ◽  
Naveen Kumar ◽  
...  

This study has developed a new ensemble model and tested another ensemble model for flood susceptibility mapping in the Middle Ganga Plain (MGP). The results of these two models have been quantitatively compared for performance analysis in zoning flood susceptible areas of low altitudinal range, humid subtropical fluvial floodplain environment of the Middle Ganga Plain (MGP). This part of the MGP, which is in the central Ganga River Basin (GRB), is experiencing worse floods in the changing climatic scenario causing an increased level of loss of life and property. The MGP experiencing monsoonal subtropical humid climate, active tectonics induced ground subsidence, increasing population, and shifting landuse/landcover trends and pattern, is the best natural laboratory to test all the susceptibility prediction genre of models to achieve the choice of best performing model with the constant number of input parameters for this type of topoclimatic environmental setting. This will help in achieving the goal of model universality, i.e., finding out the best performing susceptibility prediction model for this type of topoclimatic setting with the similar number and type of input variables. Based on the highly accurate flood inventory and using 12 flood predictors (FPs) (selected using field experience of the study area and literature survey), two machine learning (ML) ensemble models developed by bagging frequency ratio (FR) and evidential belief function (EBF) with classification and regression tree (CART), CART-FR and CART-EBF, were applied for flood susceptibility zonation mapping. Flood and non-flood points randomly generated using flood inventory have been apportioned in 70:30 ratio for training and validation of the ensembles. Based on the evaluation performance using threshold-independent evaluation statistic, area under receiver operating characteristic (AUROC) curve, 14 threshold-dependent evaluation metrices, and seed cell area index (SCAI) meant for assessing different aspects of ensembles, the study suggests that CART-EBF (AUCSR = 0.843; AUCPR = 0.819) was a better performant than CART-FR (AUCSR = 0.828; AUCPR = 0.802). The variability in performances of these novel-advanced ensembles and their comparison with results of other published models espouse the need of testing these as well as other genres of susceptibility models in other topoclimatic environments also. Results of this study are important for natural hazard managers and can be used to compute the damages through risk analysis.


2019 ◽  
Vol 18 (05) ◽  
pp. 1579-1603 ◽  
Author(s):  
Zhijiang Wan ◽  
Hao Zhang ◽  
Jiajin Huang ◽  
Haiyan Zhou ◽  
Jie Yang ◽  
...  

Many studies developed the machine learning method for discriminating Major Depressive Disorder (MDD) and normal control based on multi-channel electroencephalogram (EEG) data, less concerned about using single channel EEG collected from forehead scalp to discriminate the MDD. The EEG dataset is collected by the Fp1 and Fp2 electrode of a 32-channel EEG system. The result demonstrates that the classification performance based on the EEG of Fp1 location exceeds the performance based on the EEG of Fp2 location, and shows that single-channel EEG analysis can provide discrimination of MDD at the level of multi-channel EEG analysis. Furthermore, a portable EEG device collecting the signal from Fp1 location is used to collect the second dataset. The Classification and Regression Tree combining genetic algorithm (GA) achieves the highest accuracy of 86.67% based on leave-one-participant-out cross validation, which shows that the single-channel EEG-based machine learning method is promising to support MDD prescreening application.


10.2196/18910 ◽  
2020 ◽  
Vol 8 (7) ◽  
pp. e18910
Author(s):  
Debbie Rankin ◽  
Michaela Black ◽  
Raymond Bond ◽  
Jonathan Wallace ◽  
Maurice Mulvenna ◽  
...  

Background The exploitation of synthetic data in health care is at an early stage. Synthetic data could unlock the potential within health care datasets that are too sensitive for release. Several synthetic data generators have been developed to date; however, studies evaluating their efficacy and generalizability are scarce. Objective This work sets out to understand the difference in performance of supervised machine learning models trained on synthetic data compared with those trained on real data. Methods A total of 19 open health datasets were selected for experimental work. Synthetic data were generated using three synthetic data generators that apply classification and regression trees, parametric, and Bayesian network approaches. Real and synthetic data were used (separately) to train five supervised machine learning models: stochastic gradient descent, decision tree, k-nearest neighbors, random forest, and support vector machine. Models were tested only on real data to determine whether a model developed by training on synthetic data can used to accurately classify new, real examples. The impact of statistical disclosure control on model performance was also assessed. Results A total of 92% of models trained on synthetic data have lower accuracy than those trained on real data. Tree-based models trained on synthetic data have deviations in accuracy from models trained on real data of 0.177 (18%) to 0.193 (19%), while other models have lower deviations of 0.058 (6%) to 0.072 (7%). The winning classifier when trained and tested on real data versus models trained on synthetic data and tested on real data is the same in 26% (5/19) of cases for classification and regression tree and parametric synthetic data and in 21% (4/19) of cases for Bayesian network-generated synthetic data. Tree-based models perform best with real data and are the winning classifier in 95% (18/19) of cases. This is not the case for models trained on synthetic data. When tree-based models are not considered, the winning classifier for real and synthetic data is matched in 74% (14/19), 53% (10/19), and 68% (13/19) of cases for classification and regression tree, parametric, and Bayesian network synthetic data, respectively. Statistical disclosure control methods did not have a notable impact on data utility. Conclusions The results of this study are promising with small decreases in accuracy observed in models trained with synthetic data compared with models trained with real data, where both are tested on real data. Such deviations are expected and manageable. Tree-based classifiers have some sensitivity to synthetic data, and the underlying cause requires further investigation. This study highlights the potential of synthetic data and the need for further evaluation of their robustness. Synthetic data must ensure individual privacy and data utility are preserved in order to instill confidence in health care departments when using such data to inform policy decision-making.


2021 ◽  
Author(s):  
Younes Shekarian ◽  
Elham Rahimi ◽  
Naser Shekarian ◽  
Mohammad Rezaee ◽  
Pedram Roghanchi

Abstract In the United States, an unexpected and severe increase in coal miners’ lung diseases in the late 1990s prompted researchers to investigate the causes of the disease resurgence. This study aims to scrutinize the effects of various mining parameters, including coal rank, mine size, mining method, coal seam height, and geographical location on the prevalence of CWP in surface and underground coal mines. A comprehensive dataset was created using the U.S. Mine Safety and Health Administration (MSHA) Employment and Accident/Injury databases. The information was merged based on the mine ID by utilizing SQL data management software. A total number of 123,643 mine-year observations were included in the statistical analysis. Generalized Estimating Equation (GEE) model was used to conduct a statistical analysis on a total of 29,707, and 32,643 mine-year observations for underground and surface coal mines, respectively. The results of the econometrics approach revealed that coal workers in underground coal mines are at a greater risk of CWP comparing to those of surface coal operations. Furthermore, underground coal mines in the Appalachia and Interior regions are at a higher risk of CWP prevalence than the Western region. Surface coal mines in the Appalachian coal region are more susceptible to CWP than miners in the Western region. The analysis also indicated that coal workers working in smaller mines are more vulnerable to CWP than those in large mine sizes. Furthermore, coal workers in thin-seam underground mine operations are more likely to develop CWP.


Sign in / Sign up

Export Citation Format

Share Document