CS-BPSO: Hybrid feature selection based on chi-square and binary PSO algorithm for Arabic email authorship analysis

2021 ◽  
pp. 107224
Author(s):  
Wojdan BinSaeedan ◽  
Salwa Alramlawi
2021 ◽  
Author(s):  
◽  
Hoai Nguyen

<p>Classification aims to identify a class label of an instance according to the information from its characteristics or features. Unfortunately, many classification problems have a large feature set containing irrelevant and redundant features, which reduce the classification performance. In order to address the above problem, feature selection is proposed to select a small subset of relevant features. There are three main types of feature selection methods, i.e. wrapper, embedded and filter approaches. Wrappers use a classification algorithm to evaluate candidate feature subsets. In embedded approaches, the selection process is embedded in the training process of a classification algorithm. Different from the other two approaches, filters do not involve any classification algorithm during the selection process. Feature selection is an important process but it is not an easy task due to its large search space and complex feature interactions. Because of the potential global search ability, Evolutionary Computation (EC), especially Particle Swarm Optimization (PSO), has been widely and successfully applied to feature selection. However, there is potential to improve the effectiveness and efficiency of EC-based feature selection.  The overall goal of this thesis is to investigate and improve the capability of EC for feature selection to select small feature subsets while maintaining or even improving the classification performance compared to using all features. Different aspects of feature selection are considered in this thesis such as the number of objectives (single-objective/multi-objective), the fitness function (filter/wrapper), and the searching mechanism.  This thesis introduces a new fitness function based on mutual information which is calculated by an estimation approach instead of the traditional counting approach. Results show that the estimation approach works well on both continuous and discrete data. More importantly, mutual information calculated by the estimation approach can capture feature interactions better than the traditional counting approach.  This thesis develops a novel binary PSO algorithm, which is the first work to redefine some core concepts of PSO such as velocity and momentum to suit the characteristics of binary search spaces. Experimental results show that the proposed binary PSO algorithm evolve better solutions than other binary EC algorithms when the search spaces are large and complex. Specifically, on feature selection, the proposed binary PSO algorithm can select smaller feature subsets with similar or better classification accuracies, especially when there are a large number of features.  This thesis proposes surrogate models for wrapper-based feature selection. The surrogate models use surrogate training sets which are subsets of informative instances selected from the training set. Experimental results show that the proposed surrogate models assist PSO to reduce the computational cost while maintaining or even improving the classification performance compared to using only the original training set.  The thesis develops the first wrapper-based multi-objective feature selection algorithm using MOEA/D. A new decomposition strategy using multiple reference points for MOEA/D is designed, which can deal with different characteristics of multi-objective feature selection such as highly discontinuous Pareto fronts and complex relationships between objectives. The experimental results show that the proposed algorithm can evolve more diverse non-dominated sets than other multi-objective algorithms.   This thesis introduces the first PSO-based feature selection algorithm for transfer learning. In the proposed algorithm, the fitness function uses classification performance to reduce the differences between domains while maintaining the discriminative ability on the target domain. The experimental results show that the proposed algorithm can select feature subsets which achieve better classification performance than four state-of-the-art feature-based transfer learning algorithms.</p>


Author(s):  
Afdhalul Ihsan ◽  
Ednawati Rainarli

Text classification is the process of grouping documents based on similarity in categories. Some of the obstacles in doing text classification are many words appeared in the text, and some words come up with infrequent frequency (sparse words). The way to solve this problem is to conduct the feature selection process. There are several filter-based feature selection methods; some are Chi-Square, Information Gain, Genetic Algorithm, and Particle Swarm Optimization (PSO). Aghdam's research shows that PSO is the best among those methods. This study examined PSO to optimize the k-Nearest Neighbour (k-NN) algorithm's performance in categorizing news articles. k-NN is an algorithm that is simple and easy to implement. If we use the appropriate features, then the k-NN will be a reliable algorithm. PSO algorithm is used to select keywords (term features), and it is continued with classifying the documents using k-NN. The testing process consists of three stages. The stages are tuning the parameter of k-NN, the parameter of PSO, and measuring the testing performance. The parameter tuning process aims to determine the number of neighbours used in k-NN and optimize the PSO particles. Otherwise, the performance testing compares the performance of k-NN with and without using PSO. The optimal number of neighbours is 9, with the number of particles is 50. The testing showed that using the k-NN with PSO and a 50% reduction in terms. The results 20 per cent better accuracy than k-NN without PSO. Although the PSO's process did not always find the optimal conditions, the k-NN method can produce better accuracy. In this way, the k-NN method can work better in grouping news articles, especially in Indonesian language news articles


2021 ◽  
Author(s):  
◽  
Hoai Nguyen

<p>Classification aims to identify a class label of an instance according to the information from its characteristics or features. Unfortunately, many classification problems have a large feature set containing irrelevant and redundant features, which reduce the classification performance. In order to address the above problem, feature selection is proposed to select a small subset of relevant features. There are three main types of feature selection methods, i.e. wrapper, embedded and filter approaches. Wrappers use a classification algorithm to evaluate candidate feature subsets. In embedded approaches, the selection process is embedded in the training process of a classification algorithm. Different from the other two approaches, filters do not involve any classification algorithm during the selection process. Feature selection is an important process but it is not an easy task due to its large search space and complex feature interactions. Because of the potential global search ability, Evolutionary Computation (EC), especially Particle Swarm Optimization (PSO), has been widely and successfully applied to feature selection. However, there is potential to improve the effectiveness and efficiency of EC-based feature selection.  The overall goal of this thesis is to investigate and improve the capability of EC for feature selection to select small feature subsets while maintaining or even improving the classification performance compared to using all features. Different aspects of feature selection are considered in this thesis such as the number of objectives (single-objective/multi-objective), the fitness function (filter/wrapper), and the searching mechanism.  This thesis introduces a new fitness function based on mutual information which is calculated by an estimation approach instead of the traditional counting approach. Results show that the estimation approach works well on both continuous and discrete data. More importantly, mutual information calculated by the estimation approach can capture feature interactions better than the traditional counting approach.  This thesis develops a novel binary PSO algorithm, which is the first work to redefine some core concepts of PSO such as velocity and momentum to suit the characteristics of binary search spaces. Experimental results show that the proposed binary PSO algorithm evolve better solutions than other binary EC algorithms when the search spaces are large and complex. Specifically, on feature selection, the proposed binary PSO algorithm can select smaller feature subsets with similar or better classification accuracies, especially when there are a large number of features.  This thesis proposes surrogate models for wrapper-based feature selection. The surrogate models use surrogate training sets which are subsets of informative instances selected from the training set. Experimental results show that the proposed surrogate models assist PSO to reduce the computational cost while maintaining or even improving the classification performance compared to using only the original training set.  The thesis develops the first wrapper-based multi-objective feature selection algorithm using MOEA/D. A new decomposition strategy using multiple reference points for MOEA/D is designed, which can deal with different characteristics of multi-objective feature selection such as highly discontinuous Pareto fronts and complex relationships between objectives. The experimental results show that the proposed algorithm can evolve more diverse non-dominated sets than other multi-objective algorithms.   This thesis introduces the first PSO-based feature selection algorithm for transfer learning. In the proposed algorithm, the fitness function uses classification performance to reduce the differences between domains while maintaining the discriminative ability on the target domain. The experimental results show that the proposed algorithm can select feature subsets which achieve better classification performance than four state-of-the-art feature-based transfer learning algorithms.</p>


2018 ◽  
Vol 7 (1) ◽  
pp. 57-72
Author(s):  
H.P. Vinutha ◽  
Poornima Basavaraju

Day by day network security is becoming more challenging task. Intrusion detection systems (IDSs) are one of the methods used to monitor the network activities. Data mining algorithms play a major role in the field of IDS. NSL-KDD'99 dataset is used to study the network traffic pattern which helps us to identify possible attacks takes place on the network. The dataset contains 41 attributes and one class attribute categorized as normal, DoS, Probe, R2L and U2R. In proposed methodology, it is necessary to reduce the false positive rate and improve the detection rate by reducing the dimensionality of the dataset, use of all 41 attributes in detection technology is not good practices. Four different feature selection methods like Chi-Square, SU, Gain Ratio and Information Gain feature are used to evaluate the attributes and unimportant features are removed to reduce the dimension of the data. Ensemble classification techniques like Boosting, Bagging, Stacking and Voting are used to observe the detection rate separately with three base algorithms called Decision stump, J48 and Random forest.


2010 ◽  
Vol 9 ◽  
pp. CIN.S3794 ◽  
Author(s):  
Xiaosheng Wang ◽  
Osamu Gotoh

Gene selection is of vital importance in molecular classification of cancer using high-dimensional gene expression data. Because of the distinct characteristics inherent to specific cancerous gene expression profiles, developing flexible and robust feature selection methods is extremely crucial. We investigated the properties of one feature selection approach proposed in our previous work, which was the generalization of the feature selection method based on the depended degree of attribute in rough sets. We compared the feature selection method with the established methods: the depended degree, chi-square, information gain, Relief-F and symmetric uncertainty, and analyzed its properties through a series of classification experiments. The results revealed that our method was superior to the canonical depended degree of attribute based method in robustness and applicability. Moreover, the method was comparable to the other four commonly used methods. More importantly, the method can exhibit the inherent classification difficulty with respect to different gene expression datasets, indicating the inherent biology of specific cancers.


2014 ◽  
Vol 2014 ◽  
pp. 1-9 ◽  
Author(s):  
Hongyan Zhang ◽  
Lanzhi Li ◽  
Chao Luo ◽  
Congwei Sun ◽  
Yuan Chen ◽  
...  

In efforts to discover disease mechanisms and improve clinical diagnosis of tumors, it is useful to mine profiles for informative genes with definite biological meanings and to build robust classifiers with high precision. In this study, we developed a new method for tumor-gene selection, the Chi-square test-based integrated rank gene and direct classifier (χ2-IRG-DC). First, we obtained the weighted integrated rank of gene importance from chi-square tests of single and pairwise gene interactions. Then, we sequentially introduced the ranked genes and removed redundant genes by using leave-one-out cross-validation of the chi-square test-based Direct Classifier (χ2-DC) within the training set to obtain informative genes. Finally, we determined the accuracy of independent test data by utilizing the genes obtained above withχ2-DC. Furthermore, we analyzed the robustness ofχ2-IRG-DC by comparing the generalization performance of different models, the efficiency of different feature-selection methods, and the accuracy of different classifiers. An independent test of ten multiclass tumor gene-expression datasets showed thatχ2-IRG-DC could efficiently control overfitting and had higher generalization performance. The informative genes selected byχ2-IRG-DC could dramatically improve the independent test precision of other classifiers; meanwhile, the informative genes selected by other feature selection methods also had good performance inχ2-DC.


Sign in / Sign up

Export Citation Format

Share Document