Feature Selection for Text Classification Using Machine Learning Approaches

Author(s):  
K. Thirumoorthy ◽  
K. Muneeswaran
Mathematics ◽  
2021 ◽  
Vol 9 (11) ◽  
pp. 1226
Author(s):  
Saeed Najafi-Zangeneh ◽  
Naser Shams-Gharneh ◽  
Ali Arjomandi-Nezhad ◽  
Sarfaraz Hashemkhani Zolfani

Companies always seek ways to make their professional employees stay with them to reduce extra recruiting and training costs. Predicting whether a particular employee may leave or not will help the company to make preventive decisions. Unlike physical systems, human resource problems cannot be described by a scientific-analytical formula. Therefore, machine learning approaches are the best tools for this aim. This paper presents a three-stage (pre-processing, processing, post-processing) framework for attrition prediction. An IBM HR dataset is chosen as the case study. Since there are several features in the dataset, the “max-out” feature selection method is proposed for dimension reduction in the pre-processing stage. This method is implemented for the IBM HR dataset. The coefficient of each feature in the logistic regression model shows the importance of the feature in attrition prediction. The results show improvement in the F1-score performance measure due to the “max-out” feature selection method. Finally, the validity of parameters is checked by training the model for multiple bootstrap datasets. Then, the average and standard deviation of parameters are analyzed to check the confidence value of the model’s parameters and their stability. The small standard deviation of parameters indicates that the model is stable and is more likely to generalize well.


2011 ◽  
Vol 268-270 ◽  
pp. 697-700
Author(s):  
Rui Xue Duan ◽  
Xiao Jie Wang ◽  
Wen Feng Li

As the volume of online short text documents grow tremendously on the Internet, it is much more urgent to solve the task of organizing the short texts well. However, the traditional feature selection methods cannot suitable for the short text. In this paper, we proposed a method to incorporate syntactic information for the short text. It emphasizes the feature which has more dependency relations with other words. The classifier SVM and machine learning environment Weka are involved in our experiments. The experiment results show that incorporate syntactic information in the short text, we can get more powerful features than traditional feature selection methods, such as DF, CHI. The precision of short text classification improved from 86.2% to 90.8%.


2015 ◽  
Vol 8 (7) ◽  
pp. 5419-5435 ◽  
Author(s):  
W. Paja ◽  
M. Wrzesień ◽  
R. Niemiec ◽  
W. R. Rudnicki

Abstract. The climate models are extremely complex pieces of software. They reflect best knowledge on physical components of the climate, nevertheless, they contain several parameters, which are too weakly constrained by observations, and can potentially lead to a crash of simulation. Recently a study by Lucas et al. (2013) has shown that machine learning methods can be used for predicting which combinations of parameters can lead to crash of simulation, and hence which processes described by these parameters need refined analyses. In the current study we reanalyse the dataset used in this research using different methodology. We confirm the main conclusion of the original study concerning suitability of machine learning for prediction of crashes. We show, that only three of the eight parameters indicated in the original study as relevant for prediction of the crash are indeed strongly relevant, three other are relevant but redundant, and two are not relevant at all. We also show that the variance due to split of data between training and validation sets has large influence both on accuracy of predictions and relative importance of variables, hence only cross-validated approach can deliver robust prediction of performance and relevance of variables.


2022 ◽  
pp. 171-195
Author(s):  
Jale Bektaş

Conducting NLP for Turkish is a lot harder than other Latin-based languages such as English. In this study, by using text mining techniques, a pre-processing frame is conducted in which TF-IDF values are calculated in accordance with a linguistic approach on 7,731 tweets shared by 13 famous economists in Turkey, retrieved from Twitter. Then, the classification results are compared with four common machine learning methods (SVM, Naive Bayes, LR, and integration LR with SVM). The features represented by the TF-IDF are experimented in different N-grams. The findings show the success of a text classification problem is relative with the feature representation methods, and the performance superiority of SVM is better compared to other ML methods with unigram feature representation. The best results are obtained via the integration method of SVM with LR with the Acc of 82.9%. These results show that these methodologies are satisfying for the Turkish language.


2019 ◽  
Vol 26 (3) ◽  
pp. 1810-1826 ◽  
Author(s):  
Behnaz Raef ◽  
Masoud Maleki ◽  
Reza Ferdousi

The aim of this study is to develop a computational prediction model for implantation outcome after an embryo transfer cycle. In this study, information of 500 patients and 1360 transferred embryos, including cleavage and blastocyst stages and fresh or frozen embryos, from April 2016 to February 2018, were collected. The dataset containing 82 attributes and a target label (indicating positive and negative implantation outcomes) was constructed. Six dominant machine learning approaches were examined based on their performance to predict embryo transfer outcomes. Also, feature selection procedures were used to identify effective predictive factors and recruited to determine the optimum number of features based on classifiers performance. The results revealed that random forest was the best classifier (accuracy = 90.40% and area under the curve = 93.74%) with optimum features based on a 10-fold cross-validation test. According to the Support Vector Machine-Feature Selection algorithm, the ideal numbers of features are 78. Follicle stimulating hormone/human menopausal gonadotropin dosage for ovarian stimulation was the most important predictive factor across all examined embryo transfer features. The proposed machine learning-based prediction model could predict embryo transfer outcome and implantation of embryos with high accuracy, before the start of an embryo transfer cycle.


Sign in / Sign up

Export Citation Format

Share Document