Implementation Of Machine Learning To Determine The Best Employees Using Random Forest Method

Putri Taqwa Prasetyaningrun; Irfan Pratama; Albert Yakobus Chandra

doi:10.33005/ijconsist.v2i02.43

Implementation Of Machine Learning To Determine The Best Employees Using Random Forest Method

IJCONSIST JOURNALS ◽

10.33005/ijconsist.v2i02.43 ◽

2021 ◽

Vol 2 (02) ◽

pp. 53-59

Author(s):

Putri Taqwa Prasetyaningrun ◽

Irfan Pratama ◽

Albert Yakobus Chandra

Keyword(s):

Random Forest ◽

Real Data ◽

Random Forest Classifier ◽

Data Sets ◽

Ensemble Technique ◽

Random Forest Method ◽

The World ◽

Build System ◽

Bagging Method ◽

Employee Data

In the world of work the presence of the best employees becomes a benchmark of progress of the company itself. In the determination usually by looking at the performance of the employee e.g. from craft, discipline and also other achievements. The goal is to optimize in decision making to the best employees. Models obtained for employee predictions tested on real data sets provided by IBM analytics, which includes 29 features and about 22005 samples. In this paper we try to build system that predicts employee attribution based on A collection of employee data from kaggle website. We have used four different machines learning algorithms such as KNN (Neighbor K-Nearest), Naïve Bayes, Decision Tree, Random Forest plus two ensemble technique namely stacking and bagging. Results are expressed in terms of classic metrics and algorithms that produce the best result for the available data sets is the Random Forest classifier. It reveals the best withdrawals (0,88) as good as the stacking and bagging method with the same value

Download Full-text

A Dirichlet-Multinomial Bayes Classifier for Disease Diagnosis with Microbial Compositions

mSphere ◽

10.1128/mspheredirect.00536-17 ◽

2017 ◽

Vol 2 (6) ◽

Cited By ~ 3

Author(s):

Xiang Gao ◽

Huaiying Lin ◽

Qunfeng Dong

Keyword(s):

Random Forest ◽

Classification Accuracy ◽

Multinomial Distribution ◽

Disease Diagnosis ◽

Disease Prevalence ◽

Data Sets ◽

Bayes Classifier ◽

Data Set ◽

Random Forest Method ◽

Microbiome Data

ABSTRACT Dysbiosis of microbial communities is associated with various human diseases, raising the possibility of using microbial compositions as biomarkers for disease diagnosis. We have developed a Bayes classifier by modeling microbial compositions with Dirichlet-multinomial distributions, which are widely used to model multicategorical count data with extra variation. The parameters of the Dirichlet-multinomial distributions are estimated from training microbiome data sets based on maximum likelihood. The posterior probability of a microbiome sample belonging to a disease or healthy category is calculated based on Bayes’ theorem, using the likelihood values computed from the estimated Dirichlet-multinomial distribution, as well as a prior probability estimated from the training microbiome data set or previously published information on disease prevalence. When tested on real-world microbiome data sets, our method, called DMBC (for Dirichlet-multinomial Bayes classifier), shows better classification accuracy than the only existing Bayesian microbiome classifier based on a Dirichlet-multinomial mixture model and the popular random forest method. The advantage of DMBC is its built-in automatic feature selection, capable of identifying a subset of microbial taxa with the best classification accuracy between different classes of samples based on cross-validation. This unique ability enables DMBC to maintain and even improve its accuracy at modeling species-level taxa. The R package for DMBC is freely available at https://github.com/qunfengdong/DMBC. IMPORTANCE By incorporating prior information on disease prevalence, Bayes classifiers have the potential to estimate disease probability better than other common machine-learning methods. Thus, it is important to develop Bayes classifiers specifically tailored for microbiome data. Our method shows higher classification accuracy than the only existing Bayesian classifier and the popular random forest method, and thus provides an alternative option for using microbial compositions for disease diagnosis.

Download Full-text

Churn Prediction and Fraud Detection in Dairy Sector Using Machine Learning

Advances in Library and Information Science - Handbook of Research on Records and Information Management Strategies for Enhanced Knowledge Coordination ◽

10.4018/978-1-7998-6618-3.ch023 ◽

2021 ◽

pp. 391-406

Author(s):

Hitarth Deepak Shah ◽

Chintan M. Bhatt ◽

Shubham Mitul Patel ◽

Jayshil Bhavin Khajanchi ◽

Jaimin Narendrakumar Makwana

Keyword(s):

Machine Learning ◽

Random Forest ◽

Research Work ◽

Dairy Industry ◽

Fraud Detection ◽

Random Forest Classifier ◽

Machine Learning Algorithms ◽

Financial Capital ◽

Churn Prediction ◽

The World

India has globally been the largest milk-producing country in the world for two decades. About 400 million litres of milk is produced every day. It is the responsibility of a dairy sector to look after the farmers by providing them with various services for their livelihood. The growing financial capital of the dairy industry has enticed various fraudulent behaviour. The majority of suspicious activities are seen during the collection at local collection centres, fake farmer entries, tempered quantity and fat entries manually, and adulteration are the profound malpractices exercised by farmers. So, in this research work, the authors present a profound study on the most popular machine learning methods applied to the problems of farmer churn prediction and fraud detection in the dairies. They applied a plethora of machine learning algorithms to get accurate results for churn and fraud detection. XGBoost Classifier was the best for churn prediction with 93% accuracy, while random forest classifier turns out to be effective for fraud detection with 94% accuracy.

Download Full-text

Weather Regime Prediction Using Statistical Learning

Journal of the Atmospheric Sciences ◽

10.1175/jas3918.1 ◽

2007 ◽

Vol 64 (5) ◽

pp. 1619-1635 ◽

Cited By ~ 29

Author(s):

A. Deloncle ◽

R. Berk ◽

F. D’Andrea ◽

M. Ghil

Keyword(s):

Random Forest ◽

Nearest Neighbor ◽

Real Data ◽

Empirical Orthogonal Functions ◽

False Alarms ◽

K Nearest Neighbor ◽

Orthogonal Functions ◽

Weather Regimes ◽

Random Forest Method ◽

Sensitivity Studies

Abstract Two novel statistical methods are applied to the prediction of transitions between weather regimes. The methods are tested using a long, 6000-day simulation of a three-layer, quasigeostrophic (QG3) model on the sphere at T21 resolution. The two methods are the k nearest neighbor classifier and the random forest method. Both methods are widely used in statistical classification and machine learning; they are applied here to forecast the break of a regime and subsequent onset of another one. The QG3 model has been previously shown to possess realistic weather regimes in its northern hemisphere and preferred transitions between these have been determined. The two methods are applied to the three more robust transitions; they both demonstrate a skill of 35%–40% better than random and are thus encouraging for use on real data. Moreover, the random forest method allows one, while keeping the overall skill unchanged, to efficiently adjust the ratio of correctly predicted transitions to false alarms. A long-standing conjecture has associated regime breaks and preferred transitions with distinct directions in the reduced model phase space spanned by a few leading empirical orthogonal functions of its variability. Sensitivity studies for several predictors confirm the crucial influence of the exit angle on a preferred transition path. The present results thus support the paradigm of multiple weather regimes and their association with unstable fixed points of atmospheric dynamics.

Download Full-text

Adaptation of Error Adjusted Bagging Method for Prediction

International Journal of Data Warehousing and Mining ◽

10.4018/ijdwm.2019070102 ◽

2019 ◽

Vol 15 (3) ◽

pp. 28-45

Author(s):

Selen Yilmaz Isikhan ◽

Erdem Karabulut ◽

Afshin Samadi ◽

Saadettin Kılıçkap

Keyword(s):

Support Vector Regression ◽

Simulation Study ◽

Regression Tree ◽

Real Data ◽

Support Vector ◽

Data Sets ◽

Training Set ◽

Simulation Data ◽

Bias Variance ◽

Bagging Method

In this study, the error-adjusted bagging technique is adapted to support vector regression (SVR) and regression tree (RT) methods to obtain more accurate predictions, and then the method performances are evaluated with real data sets and a simulation study. For this purpose, the prediction performances of single models, classical bagging models, and error-adjusted bagging models obtained via complementary versions of the above-mentioned methods are constructed. The comparison is mainly based on a real dataset of 295 patients with Hodgkin's lymphoma (HL). The effect of several parameters such as training set ratio, the number of influential predictors on model performances, is examined with 500 repetitions of simulation data. The results reveal that error-adjusted bagging method provides the best performance compared to both single and classical bagging performances of the methods. Furthermore, the bias variance analysis confirms the success of this technique in reducing both bias and variance.

Download Full-text

Fake User Detection in Twitter using Random forest algorithm with Python

International Journal of Innovative Technology and Exploring Engineering - Special Issue ◽

10.35940/ijitee.g5919.059720 ◽

2020 ◽

Vol 9 (7) ◽

pp. 1293-1295

Keyword(s):

Social Networks ◽

Random Forest ◽

Social Networking ◽

Online Social Networks ◽

Social Networking Sites ◽

Major Effect ◽

Random Forest Classifier ◽

Random Forest Algorithm ◽

The World ◽

Target Platform

Millions of users are engaged with social networking sites around the world. Social sites like twitter, Facebook have a large impact on rare unwanted consequences caused in our regular life in user’s interactions. In order to disperse a large amount of inappropriate and harmful data protruding social networking sites are made as a target platform for the spammers. Twitter is main example that has become one of the important platforms for unreasonable amount of spam in all the tomes for fake users to tweet and promote websites or services that crates a major effect for legitimate users and also it disturbs resource consumption. By resulting the opening for unusual and harmful information there is an increase of fake identities that expands invalid data. Research on current online social networks (OSN) is quite natural for detection of fake users on twitter. In this paper using random forest classifier and ROC curve to detect fake users.

Download Full-text

Using Lexicon and Random Forest Classifier for Twitter Sentiment Analysis

International Journal of Computer Sciences and Engineering ◽

10.26438/ijcse/v7i6.591594 ◽

2019 ◽

Vol 7 (6) ◽

pp. 591-594

Author(s):

M. Thenmozhi ◽

R. Indira ◽

R. Dharani

Keyword(s):

Random Forest ◽

Sentiment Analysis ◽

Random Forest Classifier

Download Full-text

The Performance Improvement on 4-FB Face using Random Forest Classifier

Scientific Visualization ◽

10.26583/sv.10.5.01 ◽

2018 ◽

Vol 10 (5) ◽

pp. 1-12

Author(s):

B. Nassih ◽

A. Amine ◽

M. Ngadi ◽

D. Naji ◽

N. Hmina

Keyword(s):

Random Forest ◽

Performance Improvement ◽

Random Forest Classifier

Download Full-text

The Random Forest Method in Research of Impact of Macroeconomic Indicators of Regional Development on Informal Employment Rate

Voprosy statistiki ◽

10.34023/2313-6383-2020-27-6-37-55 ◽

2020 ◽

Vol 27 (6) ◽

pp. 37-55

Author(s):

E. V. Zarova ◽

E. I. Dubravskaya

Keyword(s):

Random Forest ◽

Regional Development ◽

Russian Federation ◽

Regional Level ◽

Informal Employment ◽

Macroeconomic Factors ◽

Random Forest Method ◽

Macroeconomic Indicators ◽

The Russian Federation ◽

The Impact

The topic of quantitative research on informal employment has a consistently high relevance both in the Russian Federation and in other countries due to its high dependence on cyclicality and crisis stages in economic dynamics of countries with any level of economic development. Developing effective government policy measures to overcome the negative impact of informal employment requires special attention in theoretical and applied research to assessing the factors and conditions of informal employment in the Russian Federation including at the regional level. Such effects of informal employment as a shortfall in taxes, potential losses in production efficiency, and negative social consequences are a concern for the authorities of the federal and regional levels. Development of quantitative indicators to determine the level of informal employment in the regions, taking into account their specifics in the general spatial and economic system of Russia are necessary to overcome these negative effects. The article proposes and tests methods for solving the problem of assessing the impact of hierarchical relationships on macroeconomic factors at the regional level of informal employment in constituent entities of the Russian Federation. Majority of the works on the study of informal employment are based on basic statistical methods of spatial-dynamic analysis, as well as on the now «traditional» methods of cluster and correlation-regression analysis. Without diminishing the merits of these methods, it should be noted that they are somewhat limited in identifying hidden structural connections and interdependencies in such a complex multidimensional phenomenon as informal employment. In order to substantiate the possibility of overcoming these limitations, the article proposes indicators of regional statistics that directly and indirectly characterize informal employment and also presents the possibilities of using the «random forest» method to identify groups of constituent entities of the Russian Federation that have similar macroeconomic factors of informal employment. The novelty of this method in terms of research objectives is that it allows one to assess the impact of macroeconomic indicators of regional development on the level of informal employment, taking into account the implicit, not predetermined by the initial hypotheses, hierarchical relationships of factor indicators. Based on the generalization of the studies presented in the literature, as well as the authors’ statistical calculations using Rosstat data, the authors came to the conclusion about the high importance of macroeconomic parameters of regional development and systemic relationships of macroeconomic indicators in substantiating the differentiation of the informal level across the constituent entities of the Russian Federation.

Download Full-text

Nglyc: A Random Forest Method for Prediction of N-Glycosylation Sites in Eukaryotic Protein Sequence

Protein and Peptide Letters ◽

10.2174/0929866526666191002111404 ◽

2020 ◽

Vol 27 (3) ◽

pp. 178-186 ◽

Cited By ~ 2

Author(s):

Ganesan Pugalenthi ◽

Varadharaju Nithya ◽

Kuo-Chen Chou ◽

Govindaraju Archunan

Keyword(s):

Random Forest ◽

Protein Sequence ◽

Glycosylation Site ◽

Computational Method ◽

The Other ◽

Eukaryotic Protein ◽

Random Forest Method ◽

Glycosylation Sites ◽

Human And Mouse ◽

Better Than

Background: N-Glycosylation is one of the most important post-translational mechanisms in eukaryotes. N-glycosylation predominantly occurs in N-X-[S/T] sequon where X is any amino acid other than proline. However, not all N-X-[S/T] sequons in proteins are glycosylated. Therefore, accurate prediction of N-glycosylation sites is essential to understand Nglycosylation mechanism. Objective: In this article, our motivation is to develop a computational method to predict Nglycosylation sites in eukaryotic protein sequences. Methods: In this article, we report a random forest method, Nglyc, to predict N-glycosylation site from protein sequence, using 315 sequence features. The method was trained using a dataset of 600 N-glycosylation sites and 600 non-glycosylation sites and tested on the dataset containing 295 Nglycosylation sites and 253 non-glycosylation sites. Nglyc prediction was compared with NetNGlyc, EnsembleGly and GPP methods. Further, the performance of Nglyc was evaluated using human and mouse N-glycosylation sites. Results: Nglyc method achieved an overall training accuracy of 0.8033 with all 315 features. Performance comparison with NetNGlyc, EnsembleGly and GPP methods shows that Nglyc performs better than the other methods with high sensitivity and specificity rate. Conclusion: Our method achieved an overall accuracy of 0.8248 with 0.8305 sensitivity and 0.8182 specificity. Comparison study shows that our method performs better than the other methods. Applicability and success of our method was further evaluated using human and mouse N-glycosylation sites. Nglyc method is freely available at https://github.com/bioinformaticsML/ Ngly.

Download Full-text

Methodology for Malware Classification using a Random Forest Classifier

2018 IEEE International Autumn Meeting on Power, Electronics and Computing (ROPEC) ◽

10.1109/ropec.2018.8661441 ◽

2018 ◽

Cited By ~ 4

Author(s):

Carlos Domenick Morales-Molina ◽

Diego Santamaria-Guerrero ◽

Gabriel Sanchez-Perez ◽

Hector Perez-Meana ◽

Aldo Hernandez-Suarez

Keyword(s):

Random Forest ◽

Random Forest Classifier ◽

Malware Classification

Download Full-text