A Comparative Study of Classification of Occupational Stress in the Insurance Sector Using Machine Learning and Filter Feature Selection Techniques

Revista Gestão Inovação e Tecnologias ◽

10.47059/revistageintec.v11i4.2623 ◽

2021 ◽

pp. 5781-5801

Author(s):

Arshad Hashmi

Keyword(s):

Feature Selection ◽

Occupational Stress ◽

Information Gain ◽

Principal Component ◽

Classification Model ◽

Support Vector ◽

Svm Classifier ◽

Selection Methods ◽

Data Set ◽

Insurance Sector

In recent years, occupational stress mining has become a widely exciting issue in the research field. The primary purpose of this study is to analyze filter feature selection methods for the efficient occupational stress classification model. We propose and examine seven different techniques of filter feature selection such as Chi-Square, Information Gain, Information Gain Ratio, Correlation, Principal Component Analysis, and Relief. The resultant selected features are then used with popular classifiers like Naive Bayes (NB), Random Forest (RF), Support Vector Machine (SVM), Artificial Neural Network (ANN), and Gradient Boosted Trees (GBT) for detection of occupational stress in the insurance sector. A survey-based psychological primary occupational stress data set is used to evaluate the relative performance of these methods. This study effectively demonstrated the significance of filter feature selection methods and explained how accurately they could help classify stress levels. This study showed that the Correlation-based feature selection with the SVM classifier obtained the best performance compared to other filter feature selection methods and classification models.

Download Full-text

A fuzzy gaussian rank aggregation ensemble feature selection method for microarray data

International Journal of Knowledge-based and Intelligent Engineering Systems ◽

10.3233/kes-190134 ◽

2021 ◽

Vol 24 (4) ◽

pp. 289-301

Author(s):

B. Venkatesh ◽

J. Anuradha

Keyword(s):

Feature Selection ◽

Microarray Data ◽

Classification Accuracy ◽

Performance Metrics ◽

Feature Selection Method ◽

Selection Method ◽

Support Vector ◽

Svm Classifier ◽

Binary Particle Swarm Optimization ◽

Selection Methods

In Microarray Data, it is complicated to achieve more classification accuracy due to the presence of high dimensions, irrelevant and noisy data. And also It had more gene expression data and fewer samples. To increase the classification accuracy and the processing speed of the model, an optimal number of features need to extract, this can be achieved by applying the feature selection method. In this paper, we propose a hybrid ensemble feature selection method. The proposed method has two phases, filter and wrapper phase in filter phase ensemble technique is used for aggregating the feature ranks of the Relief, minimum redundancy Maximum Relevance (mRMR), and Feature Correlation (FC) filter feature selection methods. This paper uses the Fuzzy Gaussian membership function ordering for aggregating the ranks. In wrapper phase, Improved Binary Particle Swarm Optimization (IBPSO) is used for selecting the optimal features, and the RBF Kernel-based Support Vector Machine (SVM) classifier is used as an evaluator. The performance of the proposed model are compared with state of art feature selection methods using five benchmark datasets. For evaluation various performance metrics such as Accuracy, Recall, Precision, and F1-Score are used. Furthermore, the experimental results show that the performance of the proposed method outperforms the other feature selection methods.

Download Full-text

Email Worm Detection Using Data Mining

Techniques and Applications for Advanced Information Privacy and Security ◽

10.4018/978-1-60566-210-7.ch002 ◽

2011 ◽

pp. 20-34

Author(s):

Mohammad M. Masud ◽

Latifur Khan ◽

Bhavani Thuraisingham

Keyword(s):

Data Mining ◽

Feature Selection ◽

Principal Component ◽

Classification Model ◽

Support Vector ◽

Two Phase ◽

Feature Selection Technique ◽

Worm Detection ◽

Phase Selection ◽

Using Data

This chapter applies data mining techniques to detect email worms. Email messages contain a number of different features such as the total number of words in message body/subject, presence/absence of binary attachments, type of attachments, and so on. The goal is to obtain an efficient classification model based on these features. The solution consists of several steps. First, the number of features is reduced using two different approaches: feature-selection and dimension-reduction. This step is necessary to reduce noise and redundancy from the data. The feature-selection technique is called Two-phase Selection (TPS), which is a novel combination of decision tree and greedy selection algorithm. The dimensionreduction is performed by Principal Component Analysis. Second, the reduced data is used to train a classifier. Different classification techniques have been used, such as Support Vector Machine (SVM), Naïve Bayes and their combination. Finally, the trained classifiers are tested on a dataset containing both known and unknown types of worms. These results have been compared with published results. It is found that the proposed TPS selection along with SVM classification achieves the best accuracy in detecting both known and unknown types of worms.

Download Full-text

To use or not to use: Feature selection for sentiment analysis of highly imbalanced data

Natural Language Engineering ◽

10.1017/s1351324917000298 ◽

2017 ◽

Vol 24 (1) ◽

pp. 3-37 ◽

Cited By ~ 5

Author(s):

SANDRA KÜBLER ◽

CAN LIU ◽

ZEESHAN ALI SAYYED

Keyword(s):

Machine Learning ◽

Feature Selection ◽

Sentiment Analysis ◽

Information Gain ◽

Binary Classification ◽

Small Subset ◽

Large Set ◽

Learning Approaches ◽

Selection Methods ◽

Data Set

AbstractWe investigate feature selection methods for machine learning approaches in sentiment analysis. More specifically, we use data from the cooking platform Epicurious and attempt to predict ratings for recipes based on user reviews. In machine learning approaches to such tasks, it is a common approach to use word or part-of-speech n-grams. This results in a large set of features, out of which only a small subset may be good indicators for the sentiment. One of the questions we investigate concerns the extension of feature selection methods from a binary classification setting to a multi-class problem. We show that an inherently multi-class approach, multi-class information gain, outperforms ensembles of binary methods. We also investigate how to mitigate the effects of extreme skewing in our data set by making our features more robust and by using review and recipe sampling. We show that over-sampling is the best method for boosting performance on the minority classes, but it also results in a severe drop in overall accuracy of at least 6 per cent points.

Download Full-text

Feature Selection Algorithm for Hyperlipidemia Classification

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.701-702.110 ◽

2014 ◽

Vol 701-702 ◽

pp. 110-113

Author(s):

Qi Rui Zhang ◽

He Xian Wang ◽

Jiang Wei Qin

Keyword(s):

Feature Selection ◽

Nearest Neighbor ◽

Information Gain ◽

Classification Systems ◽

Support Vector ◽

K Nearest Neighbor ◽

Data Set ◽

Document Frequency ◽

Selection Algorithms ◽

Term Weights

This paper reports a comparative study of feature selection algorithms on a hyperlipimedia data set. Three methods of feature selection were evaluated, including document frequency (DF), information gain (IG) and aχ2 statistic (CHI). The classification systems use a vector to represent a document and use tfidfie (term frequency, inverted document frequency, and inverted entropy) to compute term weights. In order to compare the effectives of feature selection, we used three classification methods: Naïve Bayes (NB), k Nearest Neighbor (kNN) and Support Vector Machines (SVM). The experimental results show that IG and CHI outperform significantly DF, and SVM and NB is more effective than KNN when macro-averagingF1 measure is used. DF is suitable for the task of large text classification.

Download Full-text

Comparative Analysis of Selected Heterogeneous Classifiers for Software Defects Prediction Using Filter-Based Feature Selection Methods

FUOYE Journal of Engineering and Technology ◽

10.46792/fuoyejet.v3i1.178 ◽

2018 ◽

Vol 3 (1) ◽

Cited By ~ 3

Author(s):

Abimbola G Akintola ◽

Abdullateef Balogun ◽

Fatimah B Lafenwa-Balogun ◽

Hameed A Mojeed

Keyword(s):

Feature Selection ◽

Multilayer Perceptron ◽

Principal Component ◽

Classification Model ◽

Selection Methods ◽

Software Defects ◽

Classification Techniques ◽

Irrelevant Attributes ◽

Neural Network Classifiers ◽

Data Program

Classification techniques is a popular approach to predict software defects and it involves categorizing modules, which is represented by a set of metrics or code attributes into fault prone (FP) and non-fault prone (NFP) by means of a classification model. Nevertheless, there is existence of low quality, unreliable, redundant and noisy data which negatively affect the process of observing knowledge and useful pattern. Therefore, researchers need to retrieve relevant data from huge records using feature selection methods. Feature selection is the process of identifying the most relevant attributes and removing the redundant and irrelevant attributes. In this study, the researchers investigated the effect of filter feature selection on classification techniques in software defects prediction. Ten publicly available datasets of NASA and Metric Data Program software repository were used. The topmost discriminatory attributes of the dataset were evaluated using Principal Component Analysis (PCA), CFS and FilterSubsetEval. The datasets were classified by the selected classifiers which were carefully selected based on heterogeneity. Naïve Bayes was selected from Bayes category Classifier, KNN was selected from Instance Based Learner category, J48 Decision Tree from Trees Function classifier and Multilayer perceptron was selected from the neural network classifiers. The experimental results revealed that the application of feature selection to datasets before classification in software defects prediction is better and should be encouraged and Multilayer perceptron with FilterSubsetEval had the best accuracy. It can be concluded that feature selection methods are capable of improving the performance of learning algorithms in software defects prediction.

Download Full-text

Ensemble swarm behaviour based feature selection and support vector machine classifier for chronic kidney disease prediction

International Journal of Engineering & Technology ◽

10.14419/ijet.v7i2.31.13438 ◽

2018 ◽

Vol 7 (2.31) ◽

pp. 190 ◽

Cited By ~ 1

Author(s):

S Belina V.J. Sara ◽

K Kalaiselvi

Keyword(s):

Chronic Kidney Disease ◽

Support Vector Machine ◽

Feature Selection ◽

Kidney Disease ◽

Information Gain ◽

Clonal Selection ◽

Prediction Algorithm ◽

Support Vector ◽

Svm Classifier ◽

Classification Algorithms

Kidney Disease and kidney failure is the one of the complicated and challenging health issues regarding human health. Without having any symptoms few diseases are detected in later stages which results in dialysis. Advanced excavating technologies can always give various possibilities to deal with the situation by determining important realations and associations in drilling down health related data. The prediction accuracy of classification algorithms depends upon appropriate Feature Selection (FS) algorithms decrease the number of features from collection of data. FS is the procedure of choosing the most relevant features, removing irrelevant features. To identify the Chronic Kidney Disease (CKD), Hybrid Wrapper and Filter based FS (HWFFS) algorithm is proposed to reduce the dimension of CKD dataset. Filter based FS algorithm is performed based on the three major functions: Information Gain (IG), Correlation Based Feature Selection (CFS) and Consistency Based Subset Evaluation (CS) algorithms respectively. Wrapper based FS algorithm is performed based on the Enhanced Immune Clonal Selection (EICS) algorithm to choose most important features from the CKD dataset. The results from these FS algorithms are combined with new HWFFS algorithm using classification threshold value. Finally Support Vector Machine (SVM) based prediction algorithm be proposed in order to predict CKD and being evaluated on the MATLAB platform. The results demonstrated with the purpose of the SVM classifier by using HWFFS algorithm provides higher prediction rate in the diagnosis of CKD when compared to other classification algorithms.

Download Full-text

Classifying Lithofacies from Textural Features in Whole Core CT-Scan Images

SPE Reservoir Evaluation & Engineering ◽

10.2118/205354-pa ◽

2021 ◽

pp. 1-17

Author(s):

Kurdistan Chawshin ◽

Carl F. Berg ◽

Damiano Varagnolo ◽

Andres Gonzalez ◽

Zoya Heidari ◽

...

Keyword(s):

Principal Component ◽

Ct Images ◽

Nondestructive Method ◽

Training Data ◽

Support Vector ◽

Svm Classifier ◽

Statistical Features ◽

Data Set ◽

Unseen Data ◽

Core Description

Summary X-ray computerized tomography (CT) is a nondestructive method of providing information about the internal composition and structure of whole core reservoir samples. In this study we propose a method to classify lithology. The novelty of this method is that it uses statistical and textural information extracted from whole core CT images in a supervised learning environment. In the proposed approaches, first-order statistical features and textural grey-levelco-occurrence matrix (GLCM) features are extracted from whole core CT images. Here, two workflows are considered. In the first workflow, the extracted features are used to train a support vector machine (SVM) to classify lithofacies. In the second workflow, a principal component analysis (PCA) step is added before training with two purposes: first, to eliminate collinearity among the features and second, to investigate the amount of information needed to differentiate the analyzed images. Before extracting the statistical features, the images are preprocessed and decomposed using Haar mother wavelet decomposition schemes to enhance the texture and to acquire a set of detail images that are then used to compute the statistical features. The training data set includes lithological information obtained from core description. The approach is validated using the trained SVM and hybrid (PCA + SVM) classifiers to predict lithofacies in a set of unseen data. The obtained results show that the SVM classifier can predict some of the lithofacies with high accuracy (up to 91% recall), but it misclassifies, to some extent, similar lithofacies with similar grain size, texture, and transport properties. The SVM classifier captures the heterogeneity in the whole core CT images more accurately compared with the core description, indicating that the CT images provide additional high-resolution information not observed by manual core description. Further, the obtained prediction results add information on the similarity of the lithofacies classes. The prediction results using the hybrid classifier are worse than the SVM classifier, indicating that low-power components may contain information that is required to differentiate among various lithofacies.

Download Full-text

Improved Feature-Selection Method Considering the Imbalance Problem in Text Categorization

The Scientific World JOURNAL ◽

10.1155/2014/625342 ◽

2014 ◽

Vol 2014 ◽

pp. 1-17 ◽

Cited By ~ 9

Author(s):

Jieming Yang ◽

Zhaoyang Qu ◽

Zhiying Liu

Keyword(s):

Feature Selection ◽

Text Categorization ◽

Information Gain ◽

Feature Selection Method ◽

Support Vector ◽

Selection Methods ◽

Document Collections ◽

Imbalance Problem ◽

Important Approach ◽

Selection Algorithms

The filtering feature-selection algorithm is a kind of important approach to dimensionality reduction in the field of the text categorization. Most of filtering feature-selection algorithms evaluate the significance of a feature for category based on balanced dataset and do not consider the imbalance factor of dataset. In this paper, a new scheme was proposed, which can weaken the adverse effect caused by the imbalance factor in the corpus. We evaluated the improved versions of nine well-known feature-selection methods (Information Gain, Chi statistic, Document Frequency, Orthogonal Centroid Feature Selection, DIA association factor, Comprehensive Measurement Feature Selection, Deviation from Poisson Feature Selection, improved Gini index, and Mutual Information) using naïve Bayes and support vector machines on three benchmark document collections (20-Newsgroups, Reuters-21578, and WebKB). The experimental results show that the improved scheme can significantly enhance the performance of the feature-selection methods.

Download Full-text

Sentiment classification using hybrid feature selection and ensemble classifier

Journal of Intelligent & Fuzzy Systems ◽

10.3233/jifs-189738 ◽

2021 ◽

pp. 1-10 ◽

Cited By ~ 1

Author(s):

Achin Jain ◽

Vanita Jain

Keyword(s):

Genetic Algorithm ◽

Feature Selection ◽

Information Gain ◽

Statistical Significance ◽

Sentiment Classification ◽

Ensemble Classification ◽

Support Vector ◽

Svm Classifier ◽

Feature Selection Technique ◽

Restaurant Reviews

This paper presents a Hybrid Feature Selection Technique for Sentiment Classification. We have used a Genetic Algorithm and a combination of existing Feature Selection methods, namely: Information Gain (IG), CHI Square (CHI), and GINI Index (GINI). First, we have obtained features from three different selection approaches as mentioned above and then performed the UNION SET Operation to extract the reduced feature set. Then, Genetic Algorithm is applied to optimize the feature set further. This paper also presents an Ensemble Approach based on the error rate obtained different domain datasets. To test our proposed Hybrid Feature Selection and Ensemble Classification approach, we have considered four Support Vector Machine (SVM) classifier variants. We have used UCI ML Datasets of three domains namely: IMDB Movie Review, Amazon Product Review and Yelp Restaurant Reviews. The experimental results show that our proposed approach performed best in all three domain datasets. Further, we also presented T-Test for Statistical Significance between classifiers and comparison is also done based on Precision, Recall, F1-Score, AUC and model execution time.

Download Full-text

Text Classification of Cornell Movie Data using Data Mining with Feature Selection

International Journal of Engineering and Advanced Technology - Regular Issue ◽

10.35940/ijeat.b2329.129219 ◽

2019 ◽

Vol 9 (2) ◽

pp. 2950-2955

Keyword(s):

Feature Selection ◽

Text Mining ◽

Text Classification ◽

Feature Subset Selection ◽

Support Vector ◽

Svm Classifier ◽

Feature Subset ◽

Chi Square ◽

Feature Selection Technique ◽

Data Set

Text Classification is branch of text mining through which we can analyze the sentiment of the movie data. In this research paper we have applied different preprocessing techniques to reduce the features from cornell movie data set. We have also applied the Correlation-based feature subset selection and chi-square feature selection technique for gathering most valuable words of each category in text mining processes. The new cornell movie data set formed after applying the preprocessing steps and feature selection techniques. We have classified the cornell movie data as positive or negative using various classifiers like Support Vector Machine (SVM), Multilayer Perceptron (MLP), Naive Bayes (NB), Bays Net (BN) and Random Forest (RF) classifier. We have also compared the classification accuracy among classifiers and achieved better accuracy i. e. 87% in case of SVM classifier with reduced number of features. The suggested classifier can be useful in opinion of movie review, analysis of any blog and documents etc.

Download Full-text