scholarly journals Prediction of Employee Attrition Using Machine Learning and Ensemble Methods

2021 ◽  
Vol 11 (2) ◽  
pp. 110-114
Author(s):  
Aseel Qutub ◽  
◽  
Asmaa Al-Mehmadi ◽  
Munirah Al-Hssan ◽  
Ruyan Aljohani ◽  
...  

Employees are the most valuable resources for any organization. The cost associated with professional training, the developed loyalty over the years and the sensitivity of some organizational positions, all make it very essential to identify who might leave the organization. Many reasons can lead to employee attrition. In this paper, several machine learning models are developed to automatically and accurately predict employee attrition. IBM attrition dataset is used in this work to train and evaluate machine learning models; namely Decision Tree, Random Forest Regressor, Logistic Regressor, Adaboost Model, and Gradient Boosting Classifier models. The ultimate goal is to accurately detect attrition to help any company to improve different retention strategies on crucial employees and boost those employee satisfactions.

Author(s):  
Nelson Yego ◽  
Juma Kasozi ◽  
Joseph Nkrunziza

The role of insurance in financial inclusion as well as in economic growth is immense. However, low uptake seems to impede the growth of the sector hence the need for a model that robustly predicts uptake of insurance among potential clients. In this research, we compared the performances of eight (8) machine learning models in predicting the uptake of insurance. The classifiers considered were Logistic Regression, Gaussian Naive Bayes, Support Vector Machines, K Nearest Neighbors, Decision Tree, Random Forest, Gradient Boosting Machines and Extreme Gradient boosting. The data used in the classification was from the 2016 Kenya FinAccess Household Survey. Comparison of performance was done for both upsampled and downsampled data due to data imbalance. For upsampled data, Random Forest classifier showed highest accuracy and precision compared to other classifiers but for down sampled data, gradient boosting was optimal. It is noteworthy that for both upsampled and downsampled data, tree-based classifiers were more robust than others in insurance uptake prediction. However, in spite of hyper-parameter optimization, the area under receiver operating characteristic curve remained highest for Random Forest as compared to other tree-based models. Also, the confusion matrix for Random Forest showed least false positives, and highest true positives hence could be construed as the most robust model for predicting the insurance uptake. Finally, the most important feature in predicting uptake was having a bank product hence bancassurance could be said to be a plausible channel of distribution of insurance products.


2020 ◽  
Author(s):  
Qing Wu ◽  
Fatma Nasoz ◽  
Jongyun Jung ◽  
Bibek Bhattarai

AbstractBone mineral density (BMD) is a highly heritable trait with heritability ranging from 50% to 80%. Numerous BMD-associated Single Nucleotide Polymorphisms (SNPs) were discovered by GWAS and GWAS meta-analysis. However, several studies found that combining these highly significant SNPs together only explained a small percentage of BMD variance. This inconsistency may be caused by limitations of the linear regression approaches employed because these traditional approaches lack the flexibility and the adequacy to model complex gene interactions and regulations. Hence, we developed various machine learning models of genomic data and ran experiments to identify the best machine learning model for BMD prediction at three different sites. We used genomic data of Osteoporotic Fractures in Men (MrOS) cohort Study (N=5,133) for analysis. Genotype imputation was conducted at the Sanger Imputation Server. A total of 1,103 BMD-associated SNPs were identified and corresponding weighted genetic risk scores were calculated. Genetic variants, as well as age and other traditional BMD predictors, were included for modeling. Data were normalized and were split into a training set (80%) and a test set (20%). BMD prediction models were built separately by random forest, gradient boosting, and neural network algorithms. Linear regression was used as a reference model. We applied the non-parametric Wilcoxon signed-rank tests for the measurement of MSE in each model for the pair-wise model comparison. We found that gradient boosting shows the lowest MSE for each BMD site and a prediction model built using the machine learning models achieves improved performance when a large number of SNPs are included in the models. With the predictors of phenotype covariate + 1,103 SNPs, all of the models were statistically significant except neural network vs. random forest at femoral neck BMD and gradient boosting vs. random forest at total hip BMD.


The purpose of the research described in this article is a comparative analysis of the predictive qualities of some models of machine learning and regression. The factors for models are the consumer characteristics of a used car: brand, transmission type, drive type, engine type, mileage, body type, year of manufacture, seller's region in Ukraine, condition of the car, information about accident, average price for analogue in Ukraine, engine volume, quantity of doors, availability of extra equipment, quantity of passenger’s seats, the first registration of a car, car was driven from abroad or not. Qualitative variables has been encoded as binary variables or by mean target encoding. The information about more than 200 thousand cars have been used for modeling. All models have been evaluated in the Python Software using Sklearn, Catboost, StatModels and Keras libraries. The following regression models and machine learning models were considered in the course of the study: linear regression; polynomial regression; decision tree; neural network; models based on "k-nearest neighbors", "random forest", "gradient boosting" algorithms; ensemble of models. The article presents the best in terms of quality (according to the criteria R2, MAE, MAD, MAPE) options from each class of models. It has been found that the best way to predict the price of a passenger car is through non-linear models. The results of the modeling show that the dependence between the price of a car and its characteristics is best described by the ensemble of models, which includes a neural network, models using "random forest" and "gradient boosting" algorithms. The ensemble of models showed an average relative approximation error of 11.2% and an average relative forecast error of 14.34%. All nonlinear models for car price have approximately the same predictive qualities (the difference between the MAPE within 2%) in this research.


2020 ◽  
Author(s):  
Albert Morera ◽  
Juan Martínez de Aragón ◽  
José Antonio Bonet ◽  
Jingjing Liang ◽  
Sergio de-Miguel

Abstract BackgroundThe prediction of biogeographical patterns from a large number of driving factors with complex interactions, correlations and non-linear dependences require advanced analytical methods and modelling tools. This study compares different statistical and machine learning models for predicting fungal productivity biogeographical patterns as a case study for the thorough assessment of the performance of alternative modelling approaches to provide accurate and ecologically-consistent predictions.MethodsWe evaluated and compared the performance of two statistical modelling techniques, namely, generalized linear mixed models and geographically weighted regression, and four machine learning models, namely, random forest, extreme gradient boosting, support vector machine and deep learning to predict fungal productivity. We used a systematic methodology based on substitution, random, spatial and climatic blocking combined with principal component analysis, together with an evaluation of the ecological consistency of spatially-explicit model predictions.ResultsFungal productivity predictions were sensitive to the modelling approach and complexity. Moreover, the importance assigned to different predictors varied between machine learning modelling approaches. Decision tree-based models increased prediction accuracy by ~7% compared to other machine learning approaches and by more than 25% compared to statistical ones, and resulted in higher ecological consistence at the landscape level.ConclusionsWhereas a large number of predictors are often used in machine learning algorithms, in this study we show that proper variable selection is crucial to create robust models for extrapolation in biophysically differentiated areas. When dealing with spatial-temporal data in the analysis of biogeographical patterns, climatic blocking is postulated as a highly informative technique to be used in cross-validation to assess the prediction error over larger scales. Random forest was the best approach for prediction both in sampling-like environments as well as in extrapolation beyond the spatial and climatic range of the modelling data.


Author(s):  
Farrikh Alzami ◽  
Erika Devi Udayanti ◽  
Dwi Puji Prabowo ◽  
Rama Aria Megantara

Sentiment analysis in terms of polarity classification is very important in everyday life, with the existence of polarity, many people can find out whether the respected document has positive or negative sentiment so that it can help in choosing and making decisions. Sentiment analysis usually done manually. Therefore, an automatic sentiment analysis classification process is needed. However, it is rare to find studies that discuss extraction features and which learning models are suitable for unstructured sentiment analysis types with the Amazon food review case. This research explores some extraction features such as Word Bags, TF-IDF, Word2Vector, as well as a combination of TF-IDF and Word2Vector with several machine learning models such as Random Forest, SVM, KNN and Naïve Bayes to find out a combination of feature extraction and learning models that can help add variety to the analysis of polarity sentiments. By assisting with document preparation such as html tags and punctuation and special characters, using snowball stemming, TF-IDF results obtained with SVM are suitable for obtaining a polarity classification in unstructured sentiment analysis for the case of Amazon food review with a performance result of 87,3 percent.


2021 ◽  
Vol 21 (1) ◽  
Author(s):  
Moojung Kim ◽  
Young Jae Kim ◽  
Sung Jin Park ◽  
Kwang Gi Kim ◽  
Pyung Chun Oh ◽  
...  

Abstract Background Annual influenza vaccination is an important public health measure to prevent influenza infections and is strongly recommended for cardiovascular disease (CVD) patients, especially in the current coronavirus disease 2019 (COVID-19) pandemic. The aim of this study is to develop a machine learning model to identify Korean adult CVD patients with low adherence to influenza vaccination Methods Adults with CVD (n = 815) from a nationally representative dataset of the Fifth Korea National Health and Nutrition Examination Survey (KNHANES V) were analyzed. Among these adults, 500 (61.4%) had answered "yes" to whether they had received seasonal influenza vaccinations in the past 12 months. The classification process was performed using the logistic regression (LR), random forest (RF), support vector machine (SVM), and extreme gradient boosting (XGB) machine learning techniques. Because the Ministry of Health and Welfare in Korea offers free influenza immunization for the elderly, separate models were developed for the < 65 and ≥ 65 age groups. Results The accuracy of machine learning models using 16 variables as predictors of low influenza vaccination adherence was compared; for the ≥ 65 age group, XGB (84.7%) and RF (84.7%) have the best accuracies, followed by LR (82.7%) and SVM (77.6%). For the < 65 age group, SVM has the best accuracy (68.4%), followed by RF (64.9%), LR (63.2%), and XGB (61.4%). Conclusions The machine leaning models show comparable performance in classifying adult CVD patients with low adherence to influenza vaccination.


Energies ◽  
2021 ◽  
Vol 14 (23) ◽  
pp. 7834
Author(s):  
Christopher Hecht ◽  
Jan Figgener ◽  
Dirk Uwe Sauer

Electric vehicles may reduce greenhouse gas emissions from individual mobility. Due to the long charging times, accurate planning is necessary, for which the availability of charging infrastructure must be known. In this paper, we show how the occupation status of charging infrastructure can be predicted for the next day using machine learning models— Gradient Boosting Classifier and Random Forest Classifier. Since both are ensemble models, binary training data (occupied vs. available) can be used to provide a certainty measure for predictions. The prediction may be used to adapt prices in a high-load scenario, predict grid stress, or forecast available power for smart or bidirectional charging. The models were chosen based on an evaluation of 13 different, typically used machine learning models. We show that it is necessary to know past charging station usage in order to predict future usage. Other features such as traffic density or weather have a limited effect. We show that a Gradient Boosting Classifier achieves 94.8% accuracy and a Matthews correlation coefficient of 0.838, making ensemble models a suitable tool. We further demonstrate how a model trained on binary data can perform non-binary predictions to give predictions in the categories “low likelihood” to “high likelihood”.


2021 ◽  
Vol 9 ◽  
Author(s):  
Daniel Lowell Weller ◽  
Tanzy M. T. Love ◽  
Martin Wiedmann

Recent studies have shown that predictive models can supplement or provide alternatives to E. coli-testing for assessing the potential presence of food safety hazards in water used for produce production. However, these studies used balanced training data and focused on enteric pathogens. As such, research is needed to determine 1) if predictive models can be used to assess Listeria contamination of agricultural water, and 2) how resampling (to deal with imbalanced data) affects performance of these models. To address these knowledge gaps, this study developed models that predict nonpathogenic Listeria spp. (excluding L. monocytogenes) and L. monocytogenes presence in agricultural water using various combinations of learner (e.g., random forest, regression), feature type, and resampling method (none, oversampling, SMOTE). Four feature types were used in model training: microbial, physicochemical, spatial, and weather. “Full models” were trained using all four feature types, while “nested models” used between one and three types. In total, 45 full (15 learners*3 resampling approaches) and 108 nested (5 learners*9 feature sets*3 resampling approaches) models were trained per outcome. Model performance was compared against baseline models where E. coli concentration was the sole predictor. Overall, the machine learning models outperformed the baseline E. coli models, with random forests outperforming models built using other learners (e.g., rule-based learners). Resampling produced more accurate models than not resampling, with SMOTE models outperforming, on average, oversampling models. Regardless of resampling method, spatial and physicochemical water quality features drove accurate predictions for the nonpathogenic Listeria spp. and L. monocytogenes models, respectively. Overall, these findings 1) illustrate the need for alternatives to existing E. coli-based monitoring programs for assessing agricultural water for the presence of potential food safety hazards, and 2) suggest that predictive models may be one such alternative. Moreover, these findings provide a conceptual framework for how such models can be developed in the future with the ultimate aim of developing models that can be integrated into on-farm risk management programs. For example, future studies should consider using random forest learners, SMOTE resampling, and spatial features to develop models to predict the presence of foodborne pathogens, such as L. monocytogenes, in agricultural water when the training data is imbalanced.


Sign in / Sign up

Export Citation Format

Share Document