scholarly journals Implementation Of Machine Learning To Determine The Best Employees Using Random Forest Method

2021 ◽  
Vol 2 (02) ◽  
pp. 53-59
Author(s):  
Putri Taqwa Prasetyaningrun ◽  
Irfan Pratama ◽  
Albert Yakobus Chandra

In the world of work the presence of the best employees becomes a benchmark of progress of the company itself. In the determination usually by looking at the performance of the employee e.g. from craft, discipline and also other achievements. The goal is to optimize in decision making to the best employees. Models obtained for employee predictions tested on real data sets provided by IBM analytics, which includes 29 features and about 22005 samples. In this paper we try to build system that predicts employee attribution based on A collection of employee data from kaggle website. We have used four different machines learning algorithms such as KNN (Neighbor K-Nearest), Naïve Bayes, Decision Tree, Random Forest plus two ensemble technique namely stacking and bagging. Results are expressed in terms of classic metrics and algorithms that produce the best result for the available data sets is the Random Forest classifier. It reveals the best withdrawals (0,88) as good as the stacking and bagging method with the same value

mSphere ◽  
2017 ◽  
Vol 2 (6) ◽  
Author(s):  
Xiang Gao ◽  
Huaiying Lin ◽  
Qunfeng Dong

ABSTRACT Dysbiosis of microbial communities is associated with various human diseases, raising the possibility of using microbial compositions as biomarkers for disease diagnosis. We have developed a Bayes classifier by modeling microbial compositions with Dirichlet-multinomial distributions, which are widely used to model multicategorical count data with extra variation. The parameters of the Dirichlet-multinomial distributions are estimated from training microbiome data sets based on maximum likelihood. The posterior probability of a microbiome sample belonging to a disease or healthy category is calculated based on Bayes’ theorem, using the likelihood values computed from the estimated Dirichlet-multinomial distribution, as well as a prior probability estimated from the training microbiome data set or previously published information on disease prevalence. When tested on real-world microbiome data sets, our method, called DMBC (for Dirichlet-multinomial Bayes classifier), shows better classification accuracy than the only existing Bayesian microbiome classifier based on a Dirichlet-multinomial mixture model and the popular random forest method. The advantage of DMBC is its built-in automatic feature selection, capable of identifying a subset of microbial taxa with the best classification accuracy between different classes of samples based on cross-validation. This unique ability enables DMBC to maintain and even improve its accuracy at modeling species-level taxa. The R package for DMBC is freely available at https://github.com/qunfengdong/DMBC. IMPORTANCE By incorporating prior information on disease prevalence, Bayes classifiers have the potential to estimate disease probability better than other common machine-learning methods. Thus, it is important to develop Bayes classifiers specifically tailored for microbiome data. Our method shows higher classification accuracy than the only existing Bayesian classifier and the popular random forest method, and thus provides an alternative option for using microbial compositions for disease diagnosis.


Author(s):  
Hitarth Deepak Shah ◽  
Chintan M. Bhatt ◽  
Shubham Mitul Patel ◽  
Jayshil Bhavin Khajanchi ◽  
Jaimin Narendrakumar Makwana

India has globally been the largest milk-producing country in the world for two decades. About 400 million litres of milk is produced every day. It is the responsibility of a dairy sector to look after the farmers by providing them with various services for their livelihood. The growing financial capital of the dairy industry has enticed various fraudulent behaviour. The majority of suspicious activities are seen during the collection at local collection centres, fake farmer entries, tempered quantity and fat entries manually, and adulteration are the profound malpractices exercised by farmers. So, in this research work, the authors present a profound study on the most popular machine learning methods applied to the problems of farmer churn prediction and fraud detection in the dairies. They applied a plethora of machine learning algorithms to get accurate results for churn and fraud detection. XGBoost Classifier was the best for churn prediction with 93% accuracy, while random forest classifier turns out to be effective for fraud detection with 94% accuracy.


2007 ◽  
Vol 64 (5) ◽  
pp. 1619-1635 ◽  
Author(s):  
A. Deloncle ◽  
R. Berk ◽  
F. D’Andrea ◽  
M. Ghil

Abstract Two novel statistical methods are applied to the prediction of transitions between weather regimes. The methods are tested using a long, 6000-day simulation of a three-layer, quasigeostrophic (QG3) model on the sphere at T21 resolution. The two methods are the k nearest neighbor classifier and the random forest method. Both methods are widely used in statistical classification and machine learning; they are applied here to forecast the break of a regime and subsequent onset of another one. The QG3 model has been previously shown to possess realistic weather regimes in its northern hemisphere and preferred transitions between these have been determined. The two methods are applied to the three more robust transitions; they both demonstrate a skill of 35%–40% better than random and are thus encouraging for use on real data. Moreover, the random forest method allows one, while keeping the overall skill unchanged, to efficiently adjust the ratio of correctly predicted transitions to false alarms. A long-standing conjecture has associated regime breaks and preferred transitions with distinct directions in the reduced model phase space spanned by a few leading empirical orthogonal functions of its variability. Sensitivity studies for several predictors confirm the crucial influence of the exit angle on a preferred transition path. The present results thus support the paradigm of multiple weather regimes and their association with unstable fixed points of atmospheric dynamics.


2019 ◽  
Vol 15 (3) ◽  
pp. 28-45
Author(s):  
Selen Yilmaz Isikhan ◽  
Erdem Karabulut ◽  
Afshin Samadi ◽  
Saadettin Kılıçkap

In this study, the error-adjusted bagging technique is adapted to support vector regression (SVR) and regression tree (RT) methods to obtain more accurate predictions, and then the method performances are evaluated with real data sets and a simulation study. For this purpose, the prediction performances of single models, classical bagging models, and error-adjusted bagging models obtained via complementary versions of the above-mentioned methods are constructed. The comparison is mainly based on a real dataset of 295 patients with Hodgkin's lymphoma (HL). The effect of several parameters such as training set ratio, the number of influential predictors on model performances, is examined with 500 repetitions of simulation data. The results reveal that error-adjusted bagging method provides the best performance compared to both single and classical bagging performances of the methods. Furthermore, the bias variance analysis confirms the success of this technique in reducing both bias and variance.


Millions of users are engaged with social networking sites around the world. Social sites like twitter, Facebook have a large impact on rare unwanted consequences caused in our regular life in user’s interactions. In order to disperse a large amount of inappropriate and harmful data protruding social networking sites are made as a target platform for the spammers. Twitter is main example that has become one of the important platforms for unreasonable amount of spam in all the tomes for fake users to tweet and promote websites or services that crates a major effect for legitimate users and also it disturbs resource consumption. By resulting the opening for unusual and harmful information there is an increase of fake identities that expands invalid data. Research on current online social networks (OSN) is quite natural for detection of fake users on twitter. In this paper using random forest classifier and ROC curve to detect fake users.


2018 ◽  
Vol 10 (5) ◽  
pp. 1-12
Author(s):  
B. Nassih ◽  
A. Amine ◽  
M. Ngadi ◽  
D. Naji ◽  
N. Hmina

2020 ◽  
Vol 27 (6) ◽  
pp. 37-55
Author(s):  
E. V. Zarova ◽  
E. I. Dubravskaya

The topic of quantitative research on informal employment has a consistently high relevance both in the Russian Federation and in other countries due to its high dependence on cyclicality and crisis stages in economic dynamics of countries with any level of economic development. Developing effective government policy measures to overcome the negative impact of informal employment requires special attention in theoretical and applied research to assessing the factors and conditions of informal employment in the Russian Federation including at the regional level. Such effects of informal employment as a shortfall in taxes, potential losses in production efficiency, and negative social consequences are a concern for the authorities of the federal and regional levels. Development of quantitative indicators to determine the level of informal employment in the regions, taking into account their specifics in the general spatial and economic system of Russia are necessary to overcome these negative effects. The article proposes and tests methods for solving the problem of assessing the impact of hierarchical relationships on macroeconomic factors at the regional level of informal employment in constituent entities of the Russian Federation. Majority of the works on the study of informal employment are based on basic statistical methods of spatial-dynamic analysis, as well as on the now «traditional» methods of cluster and correlation-regression analysis. Without diminishing the merits of these methods, it should be noted that they are somewhat limited in identifying hidden structural connections and interdependencies in such a complex multidimensional phenomenon as informal employment. In order to substantiate the possibility of overcoming these limitations, the article proposes indicators of regional statistics that directly and indirectly characterize informal employment and also presents the possibilities of using the «random forest» method to identify groups of constituent entities of the Russian Federation that have similar macroeconomic factors of informal employment. The novelty of this method in terms of research objectives is that it allows one to assess the impact of macroeconomic indicators of regional development on the level of informal employment, taking into account the implicit, not predetermined by the initial hypotheses, hierarchical relationships of factor indicators. Based on the generalization of the studies presented in the literature, as well as the authors’ statistical calculations using Rosstat data, the authors came to the conclusion about the high importance of macroeconomic parameters of regional development and systemic relationships of macroeconomic indicators in substantiating the differentiation of the informal level across the constituent entities of the Russian Federation.


2020 ◽  
Vol 27 (3) ◽  
pp. 178-186 ◽  
Author(s):  
Ganesan Pugalenthi ◽  
Varadharaju Nithya ◽  
Kuo-Chen Chou ◽  
Govindaraju Archunan

Background: N-Glycosylation is one of the most important post-translational mechanisms in eukaryotes. N-glycosylation predominantly occurs in N-X-[S/T] sequon where X is any amino acid other than proline. However, not all N-X-[S/T] sequons in proteins are glycosylated. Therefore, accurate prediction of N-glycosylation sites is essential to understand Nglycosylation mechanism. Objective: In this article, our motivation is to develop a computational method to predict Nglycosylation sites in eukaryotic protein sequences. Methods: In this article, we report a random forest method, Nglyc, to predict N-glycosylation site from protein sequence, using 315 sequence features. The method was trained using a dataset of 600 N-glycosylation sites and 600 non-glycosylation sites and tested on the dataset containing 295 Nglycosylation sites and 253 non-glycosylation sites. Nglyc prediction was compared with NetNGlyc, EnsembleGly and GPP methods. Further, the performance of Nglyc was evaluated using human and mouse N-glycosylation sites. Results: Nglyc method achieved an overall training accuracy of 0.8033 with all 315 features. Performance comparison with NetNGlyc, EnsembleGly and GPP methods shows that Nglyc performs better than the other methods with high sensitivity and specificity rate. Conclusion: Our method achieved an overall accuracy of 0.8248 with 0.8305 sensitivity and 0.8182 specificity. Comparison study shows that our method performs better than the other methods. Applicability and success of our method was further evaluated using human and mouse N-glycosylation sites. Nglyc method is freely available at https://github.com/bioinformaticsML/ Ngly.


Author(s):  
Carlos Domenick Morales-Molina ◽  
Diego Santamaria-Guerrero ◽  
Gabriel Sanchez-Perez ◽  
Hector Perez-Meana ◽  
Aldo Hernandez-Suarez

Sign in / Sign up

Export Citation Format

Share Document