Predicting employee attrition using tree-based models

2020 ◽  
Vol 28 (6) ◽  
pp. 1273-1291
Author(s):  
Nesreen El-Rayes ◽  
Ming Fang ◽  
Michael Smith ◽  
Stephen M. Taylor

Purpose The purpose of this study is to develop tree-based binary classification models to predict the likelihood of employee attrition based on firm cultural and management attributes. Design/methodology/approach A data set of resumes anonymously submitted through Glassdoor’s online portal is used in tandem with public company review information to fit decision tree, random forest and gradient boosted tree models to predict the probability of an employee leaving a firm during a job transition. Findings Random forest and decision tree methods are found to be the strongest attrition prediction models. In addition, compensation, company culture and senior management performance play a primary role in an employee’s decision to leave a firm. Practical implications This study may be used by human resources staff to better understand factors which influence employee attrition. In addition, techniques developed in this study may be applied to company-specific data sets to construct customized attrition models. Originality/value This study contains several novel contributions which include exploratory studies such as industry job transition percentages, distributional comparisons between factors strongly contributing to employee attrition between those who left or stayed with the firm and the first comprehensive search over binary classification models to identify which provides the strongest predictive performance of employee attrition.

2021 ◽  
Vol ahead-of-print (ahead-of-print) ◽  

Purpose The purpose of this study is to develop tree-based binary classification models to predict the likelihood of employee attrition based on firm cultural and management attributes. Design/methodology/approach After preliminary analysis, the authors tested three hypotheses. One was: “Conditioned on an increase in salary, the magnitude of increase enhances the likelihood that an employee will leave their current firm despite differences between the old and new firm culture.” Two was: “Employees whose original firm has an overall rating greater than the 75th percentile that was also founded before 1900 are more likely to stay…” Finally, three was that: “Employees that maintain a low overall original firm rating are more likely to leave their firm upon a job transition, whereas those with higher overall ratings have a greater chance of remaining.” Findings After analyzing thousands of online resumes submitted to Glassdoor’s portal, the authors found that the scale of financial compensation, the company culture and senior management performance all played a major role in influencing decisions to move on. Originality/value They offered three concrete recommendations based on the study. First, they said it was vital for companies to maintain strong Glassdoor.com ratings. The results revealed that firms in the top 10% of ratings were over 30% more likely to retain employees during a job transition than companies in the lowest 10%. Second, providing competitive salaries was necessary. Finally, the data showed a large discrepancy between senior management and CEO Glassdoor ratings. The researchers advised HR departments to closely monitor the impact of senior management behaviour.


2021 ◽  
pp. 1-13
Author(s):  
Hany Gamal ◽  
Ahmed Alsaihati ◽  
Salaheldin Elkatatny

Abstract The sonic data provides significant rock properties that are commonly used for designing the operational programs for drilling, rock fracturing, and development operations. The conventional methods for acquiring the rock sonic data in terms of compressional and shear slowness (ΔTc and ΔTs) are considered costly and time-consuming operations. The target of this paper is to proposed machine learning models for predicting the sonic logs from the drilling data in real-time. Decision tree (DT) and random forest (RF) were employed as train-based algorithms for building the sonic prediction models for drilling complex lithology rocks that have limestone, sandstone, shale, and carbonate formations. The input data for the models include the surface drilling parameters to predict the shear and compressional slowness. The study employed data set of 2888 data points for building and testing the model, while another collected 2863 data set was utilized for further validation for the sonic models. Sensitivity investigations were performed for DT and RF models to confirm optimal accuracy. The correlation of coefficient (R), and average absolute percentage error (AAPE) were used to check the models' accuracy between the actual values and models` outputs, in addition to, the sonic log profiles. The results indicated that the developed sonic models have a high capability for the sonic prediction from the drilling data as DT model recorded R higher than 0.967 and AAPE less than 2.76% for ΔTc and ΔTs models, while RF showed R higher than 0.991 with AAPE less than 1.07%. The further validation process for the developed models indicated the great results for the sonic prediction and RF model outperformed DT models as RF showed R higher than 0.986 with AAPE less than 1.12% while DT prediction recorded R greater than 0.93 with AAPE less than 1.95%. The sonic prediction through the developed models will save the cost and time for acquiring the sonic data through the conventional methods and will provide real-time estimation from the drilling parameters.


2020 ◽  
Vol 4 (Supplement_1) ◽  
pp. 268-269
Author(s):  
Jaime Speiser ◽  
Kathryn Callahan ◽  
Jason Fanning ◽  
Thomas Gill ◽  
Anne Newman ◽  
...  

Abstract Advances in computational algorithms and the availability of large datasets with clinically relevant characteristics provide an opportunity to develop machine learning prediction models to aid in diagnosis, prognosis, and treatment of older adults. Some studies have employed machine learning methods for prediction modeling, but skepticism of these methods remains due to lack of reproducibility and difficulty understanding the complex algorithms behind models. We aim to provide an overview of two common machine learning methods: decision tree and random forest. We focus on these methods because they provide a high degree of interpretability. We discuss the underlying algorithms of decision tree and random forest methods and present a tutorial for developing prediction models for serious fall injury using data from the Lifestyle Interventions and Independence for Elders (LIFE) study. Decision tree is a machine learning method that produces a model resembling a flow chart. Random forest consists of a collection of many decision trees whose results are aggregated. In the tutorial example, we discuss evaluation metrics and interpretation for these models. Illustrated in data from the LIFE study, prediction models for serious fall injury were moderate at best (area under the receiver operating curve of 0.54 for decision tree and 0.66 for random forest). Machine learning methods may offer improved performance compared to traditional models for modeling outcomes in aging, but their use should be justified and output should be carefully described. Models should be assessed by clinical experts to ensure compatibility with clinical practice.


2021 ◽  
Vol ahead-of-print (ahead-of-print) ◽  
Author(s):  
Rajit Nair ◽  
Santosh Vishwakarma ◽  
Mukesh Soni ◽  
Tejas Patel ◽  
Shubham Joshi

Purpose The latest 2019 coronavirus (COVID-2019), which first appeared in December 2019 in Wuhan's city in China, rapidly spread around the world and became a pandemic. It has had a devastating impact on daily lives, the public's health and the global economy. The positive cases must be identified as soon as possible to avoid further dissemination of this disease and swift care of patients affected. The need for supportive diagnostic instruments increased, as no specific automated toolkits are available. The latest results from radiology imaging techniques indicate that these photos provide valuable details on the virus COVID-19. User advanced artificial intelligence (AI) technologies and radiological imagery can help diagnose this condition accurately and help resolve the lack of specialist doctors in isolated areas. In this research, a new paradigm for automatic detection of COVID-19 with bare chest X-ray images is displayed. Images are presented. The proposed model DarkCovidNet is designed to provide correct binary classification diagnostics (COVID vs no detection) and multi-class (COVID vs no results vs pneumonia) classification. The implemented model computed the average precision for the binary and multi-class classification of 98.46% and 91.352%, respectively, and an average accuracy of 98.97% and 87.868%. The DarkNet model was used in this research as a classifier for a real-time object detection method only once. A total of 17 convolutionary layers and different filters on each layer have been implemented. This platform can be used by the radiologists to verify their initial application screening and can also be used for screening patients through the cloud. Design/methodology/approach This study also uses the CNN-based model named Darknet-19 model, and this model will act as a platform for the real-time object detection system. The architecture of this system is designed in such a way that they can be able to detect real-time objects. This study has developed the DarkCovidNet model based on Darknet architecture with few layers and filters. So before discussing the DarkCovidNet model, look at the concept of Darknet architecture with their functionality. Typically, the DarkNet architecture consists of 5 pool layers though the max pool and 19 convolution layers. Assume as a convolution layer, and as a pooling layer. Findings The work discussed in this paper is used to diagnose the various radiology images and to develop a model that can accurately predict or classify the disease. The data set used in this work is the images bases on COVID-19 and non-COVID-19 taken from the various sources. The deep learning model named DarkCovidNet is applied to the data set, and these have shown signification performance in the case of binary classification and multi-class classification. During the multi-class classification, the model has shown an average accuracy 98.97% for the detection of COVID-19, whereas in a multi-class classification model has achieved an average accuracy of 87.868% during the classification of COVID-19, no detection and Pneumonia. Research limitations/implications One of the significant limitations of this work is that a limited number of chest X-ray images were used. It is observed that patients related to COVID-19 are increasing rapidly. In the future, the model on the larger data set which can be generated from the local hospitals will be implemented, and how the model is performing on the same will be checked. Originality/value Deep learning technology has made significant changes in the field of AI by generating good results, especially in pattern recognition. A conventional CNN structure includes a convolution layer that extracts characteristics from the input using the filters it applies, a pooling layer that reduces calculation efficiency and the neural network's completely connected layer. A CNN model is created by integrating one or more of these layers, and its internal parameters are modified to accomplish a specific mission, such as classification or object recognition. A typical CNN structure has a convolution layer that extracts features from the input with the filters it applies, a pooling layer to reduce the size for computational performance and a fully connected layer, which is a neural network. A CNN model is created by combining one or more such layers, and its internal parameters are adjusted to accomplish a particular task, such as classification or object recognition.


2015 ◽  
Vol 54 (06) ◽  
pp. 560-567 ◽  
Author(s):  
K. Zhu ◽  
Z. Lou ◽  
J. Zhou ◽  
N. Ballester ◽  
P. Parikh ◽  
...  

SummaryIntroduction: This article is part of the Focus Theme of Methods of Information in Medicine on “Big Data and Analytics in Healthcare”.Background: Hospital readmissions raise healthcare costs and cause significant distress to providers and patients. It is, therefore, of great interest to healthcare organizations to predict what patients are at risk to be readmitted to their hospitals. However, current logistic regression based risk prediction models have limited prediction power when applied to hospital administrative data. Meanwhile, although decision trees and random forests have been applied, they tend to be too complex to understand among the hospital practitioners.Objectives: Explore the use of conditional logistic regression to increase the prediction accuracy.Methods: We analyzed an HCUP statewide in-patient discharge record dataset, which includes patient demographics, clinical and care utilization data from California. We extracted records of heart failure Medicare beneficiaries who had inpatient experience during an 11-month period. We corrected the data imbalance issue with under-sampling. In our study, we first applied standard logistic regression and decision tree to obtain influential variables and derive practically meaning decision rules. We then stratified the original data set accordingly and applied logistic regression on each data stratum. We further explored the effect of interacting variables in the logistic regression modeling. We conducted cross validation to assess the overall prediction performance of conditional logistic regression (CLR) and compared it with standard classification models.Results: The developed CLR models outperformed several standard classification models (e.g., straightforward logistic regression, stepwise logistic regression, random forest, support vector machine). For example, the best CLR model improved the classification accuracy by nearly 20% over the straightforward logistic regression model. Furthermore, the developed CLR models tend to achieve better sensitivity of more than 10% over the standard classification models, which can be translated to correct labeling of additional 400 – 500 readmissions for heart failure patients in the state of California over a year. Lastly, several key predictor identified from the HCUP data include the disposition location from discharge, the number of chronic conditions, and the number of acute procedures.Conclusions: It would be beneficial to apply simple decision rules obtained from the decision tree in an ad-hoc manner to guide the cohort stratification. It could be potentially beneficial to explore the effect of pairwise interactions between influential predictors when building the logistic regression models for different data strata. Judicious use of the ad-hoc CLR models developed offers insights into future development of prediction models for hospital readmissions, which can lead to better intuition in identifying high-risk patients and developing effective post-discharge care strategies. Lastly, this paper is expected to raise the awareness of collecting data on additional markers and developing necessary database infrastructure for larger-scale exploratory studies on readmission risk prediction.


2021 ◽  
pp. 1-21
Author(s):  
Hany Gamal ◽  
Ahmed Alsaihati ◽  
Salaheldin Elkatatny ◽  
Saleh Haidary ◽  
Abdulazeez Abdulraheem

Abstract The rock unconfined compressive strength (UCS) is one of the key parameters for geomechanical and reservoir modeling in the petroleum industry. Obtaining the UCS by conventional methods such as experimental work or empirical correlation from logging data are time consuming and highly cost. To overcome these drawbacks, this paper utilized the help of artificial intelligence (AI) to predict (in a real-time) the rock strength from the drilling parameters using two AI tools. Random forest (RF) based on principal component analysis (PCA), and functional network (FN) techniques were employed to build two UCS prediction models based on the drilling data such as weight on bit (WOB), drill string rotating-speed (RS), drilling torque (T), stand-pipe pressure (SPP), mud pumping rate (Q), and the rate of penetration (ROP). The models were built using 2,333 data points from well (A) with 70:30 training to testing ratio. The models were validated using unseen data set (1,300 data points) of Well (B) which is located in the same field and drilled across the same complex lithology. The results of the PCA-based RF model outperformed the FN in terms of correlation coefficient (R) and average absolute percentage error (AAPE). The overall accuracy for PCA-based RF was R of 0.99 and AAPE of 4.3 %, and for FN yielded R of 0.97 and AAPE of 8.5%. The validation results showed that R was 0.99 for RF and 0.96 for FN, while the AAPE was 4 and 7.9 % for RF and FN models, respectively. The developed PCA-based RF and FN models provide an accurate UCS estimation in real-time from the drilling data, saving time and cost and enhancing the well stability by generating UCS log from the rig drilling data.


2020 ◽  
Vol ahead-of-print (ahead-of-print) ◽  
Author(s):  
Arpita Gupta ◽  
Saloni Priyani ◽  
Ramadoss Balakrishnan

Purpose In this study, the authors have used the customer reviews of books and movies in natural language for the purpose of sentiment analysis and reputation generation on the reviews. Most of the existing work has performed sentiment analysis and reputation generation on the reviews by using single classification models and considered other attributes for reputation generation. Design/methodology/approach The authors have taken review, helpfulness and rating into consideration. In this paper, the authors have performed sentiment analysis for extracting the probability of the review belonging to a class, which is further used for generating the sentiment score and reputation of the review. The authors have used pre-trained BERT fine-tuned for sentiment analysis on movie and book reviews separately. Findings In this study, the authors have also combined the three models (BERT, Naïve Bayes and SVM) for more accurate sentiment classification and reputation generation, which has outperformed the best BERT model in this study. They have achieved the best accuracy of 91.2% for the movie review data set and 89.4% for the book review data set which is better than the existing state-of-art methods. They have used the transfer learning concept in deep learning where you take knowledge gained from one problem and apply it to a similar problem. Originality/value The authors have proposed a novel model based on combination of three classification models, which has outperformed the existing state-of-art methods. To the best of the authors’ knowledge, there is no existing model which combines three models for sentiment score calculation and reputation generation for the book review data set.


2018 ◽  
Vol 41 (1) ◽  
pp. 96-112 ◽  
Author(s):  
Evy Rombaut ◽  
Marie-Anne Guerry

Purpose This paper aims to question whether the available data in the human resources (HR) system could result in reliable turnover predictions without supplementary survey information. Design/methodology/approach A decision tree approach and a logistic regression model for analysing turnover were introduced. The methodology is illustrated on a real-life data set of a Belgian branch of a private company. The model performance is evaluated by the area under the ROC curve (AUC) measure. Findings It was concluded that data in the personnel system indeed lead to valuable predictions of turnover. Practical implications The presented approach brings determinants of voluntary turnover to the surface. The results yield useful information for HR departments. Where the logistic regression results in a turnover probability at the individual level, the decision tree makes it possible to ascertain employee groups that are at risk for turnover. With the data set-based approach, each company can, immediately, ascertain their own turnover risk. Originality/value The study of a data-driven approach for turnover investigation has not been done so far.


Author(s):  
Mochammad Agus Afrianto ◽  
Meditya Wasesa

Background: Literature in the peer-to-peer accommodation has put a substantial focus on accommodation listings' price determinants. Developing prediction models related to the demand for accommodation listings is vital in revenue management because accurate price and demand forecasts will help determine the best revenue management responses.Objective: This study aims to develop prediction models to determine the booking likelihood of accommodation listings.Methods: Using an Airbnb dataset, we developed four machine learning models, namely Logistics Regression, Decision Tree, K-Nearest Neighbor (KNN), and Random Forest Classifiers. We assessed the models using the AUC-ROC score and the model development time by using the ten-fold three-way split and the ten-fold cross-validation procedures.Results: In terms of average AUC-ROC score, the Random Forest Classifiers outperformed other evaluated models. In three-ways split procedure, it had a 15.03% higher AUC-ROC score than Decision Tree, 2.93 % higher than KNN, and 2.38% higher than Logistics Regression. In the cross-validation procedure, it has a 26,99% higher AUC-ROC score than Decision Tree, 4.41 % higher than KNN, and 3.31% higher than Logistics Regression.  It should be noted that the Decision Tree model has the lowest AUC-ROC score, but it has the smallest model development time.Conclusion: The performance of random forest models in predicting booking likelihood of accommodation listings is the most superior. The model can be used by peer-to-peer accommodation owners to improve their revenue management responses. 


Sign in / Sign up

Export Citation Format

Share Document