scholarly journals Clustering as feature selection method in spam classification: uncovering sick-leave sellers

2021 ◽  
Vol ahead-of-print (ahead-of-print) ◽  
Author(s):  
Mariam Elhussein ◽  
Samiha Brahimi

PurposeThis paper aims to propose a novel way of using textual clustering as a feature selection method. It is applied to identify the most important keywords in the profile classification. The method is demonstrated through the problem of sick-leave promoters on Twitter.Design/methodology/approachFour machine learning classifiers were used on a total of 35,578 tweets posted on Twitter. The data were manually labeled into two categories: promoter and nonpromoter. Classification performance was compared when the proposed clustering feature selection approach and the standard feature selection were applied.FindingsRadom forest achieved the highest accuracy of 95.91% higher than similar work compared. Furthermore, using clustering as a feature selection method improved the Sensitivity of the model from 73.83% to 98.79%. Sensitivity (recall) is the most important measure of classifier performance when detecting promoters’ accounts that have spam-like behavior.Research limitations/implicationsThe method applied is novel, more testing is needed in other datasets before generalizing its results.Practical implicationsThe model applied can be used by Saudi authorities to report on the accounts that sell sick-leaves online.Originality/valueThe research is proposing a new way textual clustering can be used in feature selection.

Sentiment analysis plays a major role in e-commerce and social media these days. Due to the increasing growth of social media, a huge number of peoples and users send their reviews through the Internet and several other sources. Analyzing this data is challenging in today's life. In this paper new normalization based feature selection method is proposed and the topic of interest here is to select the relevant features and perform the classification of the data and find the accuracy. Stability of the data is considered as the most important challenge in analyzing the sentiments. In this paper investigating the sentiments and selecting the relevant features from the data set places a major role. The aim is to work with the vector-based feature selection and check the classification performance using recurrent networks. In this paper, text mining depends on feature retrieval methods to improve accuracy and propose a single matrix normalization method to reduce the dimensions. The proposed method performs data preprocessing or sentiment classification and features reduction to improve accuracy. The proposed method achieves better accuracy than the N-gram feature selection method. The experimental results show that the proposed method has better accuracy than other traditional feature selection approaches and that the proposed method can decrease the implementation time.


2020 ◽  
Author(s):  
Esra Sarac Essiz ◽  
Murat Oturakci

Abstract As a nature-inspired algorithm, artificial bee colony (ABC) is an optimization algorithm that is inspired by the search behaviour of honey bees. The main aim of this study is to examine the effects of the ABC-based feature selection algorithm on classification performance for cyberbullying, which has become a significant worldwide social issue in recent years. With this purpose, the classification performance of the proposed ABC-based feature selection method is compared with three different traditional methods such as information gain, ReliefF and chi square. Experimental results present that ABC-based feature selection method outperforms than three traditional methods for the detection of cyberbullying. The Macro averaged F_measure of the data set is increased from 0.659 to 0.8 using proposed ABC-based feature selection method.


2015 ◽  
Vol 77 (7) ◽  
Author(s):  
Syamimi Mardiah Shaharum ◽  
Kenneth Sundaraj ◽  
Khaled Helmy

In this work, we show that the classification performance of a high-dimensional features data can be improved by applying feature selection method. One-way ANOVA were utilized and to evaluate the performance measure of the feature selection method, Artificial Neural Network (ANN) was used. From the results obtained, it can be concluded that ANN performance using feature that undergo feature selection method produce a better classification accuracy compared to the ANN performance using feature that did not undergo feature selection method with 93.33% against 80.00% accuracy achieved. Therefore can be conclude that feature selection is a process that is crucial to be done in order to produce a good performance rate. 


2020 ◽  
Vol 38 (4) ◽  
pp. 835-858
Author(s):  
Jiaming Liu ◽  
Liuan Wang ◽  
Linan Zhang ◽  
Zeming Zhang ◽  
Sicheng Zhang

PurposeThe primary objective of this study was to recognize critical indicators in predicting blood glucose (BG) through data-driven methods and to compare the prediction performance of four tree-based ensemble models, i.e. bagging with tree regressors (bagging-decision tree [Bagging-DT]), AdaBoost with tree regressors (Adaboost-DT), random forest (RF) and gradient boosting decision tree (GBDT).Design/methodology/approachThis study proposed a majority voting feature selection method by combining lasso regression with the Akaike information criterion (AIC) (LR-AIC), lasso regression with the Bayesian information criterion (BIC) (LR-BIC) and RF to select indicators with excellent predictive performance from initial 38 indicators in 5,642 samples. The selected features were deployed to build the tree-based ensemble models. The 10-fold cross-validation (CV) method was used to evaluate the performance of each ensemble model.FindingsThe results of feature selection indicated that age, corpuscular hemoglobin concentration (CHC), red blood cell volume distribution width (RBCVDW), red blood cell volume and leucocyte count are five most important clinical/physical indicators in BG prediction. Furthermore, this study also found that the GBDT ensemble model combined with the proposed majority voting feature selection method is better than other three models with respect to prediction performance and stability.Practical implicationsThis study proposed a novel BG prediction framework for better predictive analytics in health care.Social implicationsThis study incorporated medical background and machine learning technology to reduce diabetes morbidity and formulate precise medical schemes.Originality/valueThe majority voting feature selection method combined with the GBDT ensemble model provides an effective decision-making tool for predicting BG and detecting diabetes risk in advance.


2014 ◽  
Vol 2014 ◽  
pp. 1-8 ◽  
Author(s):  
Jianzhong Wang ◽  
Shuang Zhou ◽  
Yugen Yi ◽  
Jun Kong

Feature selection is a key issue in the domain of machine learning and related fields. The results of feature selection can directly affect the classifier’s classification accuracy and generalization performance. Recently, a statistical feature selection method named effective range based gene selection (ERGS) is proposed. However, ERGS only considers the overlapping area (OA) among effective ranges of each class for every feature; it fails to handle the problem of the inclusion relation of effective ranges. In order to overcome this limitation, a novel efficient statistical feature selection approach called improved feature selection based on effective range (IFSER) is proposed in this paper. In IFSER, an including area (IA) is introduced to characterize the inclusion relation of effective ranges. Moreover, the samples’ proportion for each feature of every class in both OA and IA is also taken into consideration. Therefore, IFSER outperforms the original ERGS and some other state-of-the-art algorithms. Experiments on several well-known databases are performed to demonstrate the effectiveness of the proposed method.


Symmetry ◽  
2020 ◽  
Vol 12 (12) ◽  
pp. 1995
Author(s):  
Chunlei Shi ◽  
Jiacai Zhang ◽  
Xia Wu

Autism spectrum disorder (ASD) is a neurodevelopmental disorder originating in infancy and childhood that may cause language barriers and social difficulties. However, in the diagnosis of ASD, the current machine learning methods still face many challenges in determining the location of biomarkers. Here, we proposed a novel feature selection method based on the minimum spanning tree (MST) to seek neuromarkers for ASD. First, we constructed an undirected graph with nodes of candidate features. At the same time, a weight calculation method considering both feature redundancy and discriminant ability was introduced. Second, we utilized the Prim algorithm to construct the MST from the initial graph structure. Third, the sum of the edge weights of all connected nodes was sorted for each node in the MST. Then, N features corresponding to the nodes with the first N smallest sum were selected as classification features. Finally, the support vector machine (SVM) algorithm was used to evaluate the discriminant performance of the aforementioned feature selection method. Comparative experiments results show that our proposed method has improved the ASD classification performance, i.e., the accuracy, sensitivity, and specificity were 86.7%, 87.5%, and 85.7%, respectively.


2020 ◽  
Vol 10 (2) ◽  
pp. 588
Author(s):  
Sang Hoon Lee ◽  
Kwang-Yul Kim ◽  
Yoan Shin

Recently, in order to satisfy the requirements of commercial communication systems and military communication systems, automatic modulation classification (AMC) schemes have been considered. As a result, various artificial intelligence algorithms such as a deep neural network (DNN), a convolutional neural network (CNN), and a recurrent neural network (RNN) have been studied to improve the AMC performance. However, since the AMC process should be operated in real time, the computational complexity must be considered low enough. Furthermore, there is a lack of research to consider the complexity of the AMC process using the data-mining method. In this paper, we propose a correlation coefficient-based effective feature selection method that can maintain the classification performance while reducing the computational complexity of the AMC process. The proposed method calculates the correlation coefficients of second, fourth, and sixth-order cumulants with the proposed formula and selects an effective feature according to the calculated values. In the proposed method, the deep learning-based AMC method is used to measure and compare the classification performance. From the simulation results, it is indicated that the AMC performance of the proposed method is superior to the conventional methods even though it uses a small number of features.


Author(s):  
Mehdi Rahnama ◽  
Abolfazl Vahedi ◽  
Arta Mohammad-Alikhani ◽  
Noureddine Takorabet

Purpose On-time fault diagnosis in electrical machines is a critical issue, as it can prevent the development of fault and also reduce the repairing time and cost. In brushless synchronous generators, the significance of the fault diagnosis is even more because they are widely used to generate electrical power all around the world. Therefore, this study aims to propose a fault detection approach for the brushless synchronous generator. In this approach, a novel extension of Relief feature selection method is developed. Design/methodology/approach In this paper, by taking the advantages of the finite element method (FEM), a brushless synchronous machine is modeled to evaluate the machine performance under two conditions. These conditions include the normal condition of the machine and one diode open-circuit of the rotating rectifier. Therefore, the harmonic behavior of the terminal voltage of the machine is obtained under these situations. Then, the harmonic components are ranked by using the extension of Relief to extract the most appropriate components for fault detection. Therefore, a fault detection approach is proposed based on the ranked harmonic components and support vector machine classifier. Findings The proposed diagnosis approach is verified by using an experimental test. Results show that by this approach open-circuit fault on the diode rectifier can effectively be detected by the accuracy of 98.5% and by using five harmonic components of the terminal voltage [1]. Originality/value In this paper, a novel feature selection method is proposed to select the most effective FFT components based on an extension of Relief method, and besides, FEM modeling of a brushless synchronous generator for normal and one diode open-circuit fault.


2018 ◽  
Vol 7 (2.15) ◽  
pp. 146
Author(s):  
Abdullah Yousef Al-Qammaz ◽  
Farzana Kabir Ahmad ◽  
Yuhanis Yusof

Due to some limitations of current heuristics and evolutionary algorithms, this paper proposed a new swarm based algorithm for feature selection method called Social Spider Optimization (SSO-FS). In this research, SSO-FS is used in the EEG-based emotion recognition model as searching method to find optimal feature set to maximize classification performance and mimics the cooperative behaviour and mechanism of social spiders in nature. This proposed feature selection method has been tested on DEAP EEG dataset with six subjects and compared with the most popular heuristic algorithms such as GA, PSO and ABC. The results show that the SSO-FS provides a remarkable and comparable performance compared to other existing methods. Whereby, the max accuracy obtained is 66.66% and 70.83%, the mean accuracy obtained is 55.51 7.17 and 60.97 8.38 for 3-level of valence emotions and 3-level of arousal emotions classification respectively.  


Sign in / Sign up

Export Citation Format

Share Document