scholarly journals A Novel Community Detection Based Genetic Algorithm for Feature Selection

2020 ◽  
Author(s):  
Mehrdad Rostami ◽  
Kamal Berahmand ◽  
Saman Forouzandeh

Abstract The selection of features is an essential data preprocessing stage in data mining. The core principle of feature selection seems to be to pick a subset of possible features by excluding features with almost no predictive information as well as highly associated redundant features. In the past several years, a variety of meta-heuristic methods were introduced to eliminate redundant and irrelevant features as much as possible from high-dimensional datasets. Among the main disadvantages of present meta-heuristic based approaches is that they are often neglecting the correlation between a set of selected features. In this article, for the purpose of feature selection, the authors propose a genetic algorithm based on community detection, which functions in three steps. The feature similarities are calculated in the first step. The features are classified by community detection algorithms into clusters throughout the second step. In the third step, features are picked by a genetic algorithm with a new community-based repair operation. Nine benchmark classification problems were analyzed in terms of the performance of the presented approach. Also, the authors have compared the efficiency of the proposed approach with the findings from four available algorithms for feature selection. The findings indicate that the new approach continuously yields improved classification accuracy.

2021 ◽  
Vol 8 (1) ◽  
Author(s):  
Mehrdad Rostami ◽  
Kamal Berahmand ◽  
Saman Forouzandeh

AbstractThe feature selection is an essential data preprocessing stage in data mining. The core principle of feature selection seems to be to pick a subset of possible features by excluding features with almost no predictive information as well as highly associated redundant features. In the past several years, a variety of meta-heuristic methods were introduced to eliminate redundant and irrelevant features as much as possible from high-dimensional datasets. Among the main disadvantages of present meta-heuristic based approaches is that they are often neglecting the correlation between a set of selected features. In this article, for the purpose of feature selection, the authors propose a genetic algorithm based on community detection, which functions in three steps. The feature similarities are calculated in the first step. The features are classified by community detection algorithms into clusters throughout the second step. In the third step, features are picked by a genetic algorithm with a new community-based repair operation. Nine benchmark classification problems were analyzed in terms of the performance of the presented approach. Also, the authors have compared the efficiency of the proposed approach with the findings from four available algorithms for feature selection. Comparing the performance of the proposed method with three new feature selection methods based on PSO, ACO, and ABC algorithms on three classifiers showed that the accuracy of the proposed method is on average 0.52% higher than the PSO, 1.20% higher than ACO, and 1.57 higher than the ABC algorithm.


2011 ◽  
Vol 2011 ◽  
pp. 1-19
Author(s):  
Armelle Brun ◽  
Sylvain Castagnos ◽  
Anne Boyer

The number of items that users can now access when navigating on the Web is so huge that these might feel lost. Recommender systems are a way to cope with this profusion of data by suggesting items that fit the users needs. One of the most popular techniques for recommender systems is the collaborative filtering approach that relies on the preferences of items expressed by users, usually under the form of ratings. In the absence of ratings, classical collaborative filtering techniques cannot be applied. Fortunately, the behavior of users, such as their consultations, can be collected. In this paper, we present a new approach to perform collaborative filtering when no rating is available but when user consultations are known. We propose to take inspiration from local community detection algorithms to form communities of users and deduce the set of mentors of a given user. We adapt one state-of-the-art algorithm so as to fit the characteristics of collaborative filtering. Experiments conducted show that the precision achieved is higher then the baseline that does not perform any mentor selection. In addition, our model almost offsets the absence of ratings by exploiting a reduced set of mentors.


2019 ◽  
Vol 11 (1) ◽  
Author(s):  
Qibo Yang ◽  
Jaskaran Singh ◽  
Jay Lee

For high-dimensional datasets, bad features and complex interactions between features can cause high computational costs and make outlier detection algorithms inefficient. Most feature selection methods are designed for supervised classification and regression, and limited works are specifically for unsupervised outlier detection. This paper proposes a novel isolation-based feature selection (IBFS) method for unsupervised outlier detection. It is based on the training process of isolation forest. When a point of a feature is used to split the data, the imbalanced distribution of split data is measured and used to quantify how strong this feature can detect outliers. We also compare the proposed method with variance, Laplacian score and kurtosis. These methods are benchmarked on simulated data to show their characteristics. Then we evaluate the performance using one-class support vector machine, isolation forest and local outlier factor on several real-word datasets. The evaluation results show that the proposed method can improve the performance of isolation forest, and its results are similar to and sometimes better than another useful outlier indicator: kurtosis, which demonstrate the effectiveness of the proposed method. We also notice that sometimes variance and Laplacian score has similar performance on the datasets.


Feature Selection in High Dimensional Datasets is a combinatorial problem as it selects the optimal subsets from N dimensional data having 2N possible subsets. Genetic Algorithms are generally a good choice for feature selection in large datasets, though for some high dimensional problems it may take varied amount of time - few seconds, few hours or even few days. Therefore, it is important to use Genetic Algorithms that can give quality results in reasonably acceptable time limit. For this purpose, it is becoming necessary to implement Genetic Algorithms in an efficient manner. In this paper, a Master Slave Parallel Genetic Algorithm is implemented as a Feature Selection procedure to diminish the time intricacies of sequential genetic algorithm. This paper describes the speed gains in parallel Master-Slave Genetic Algorithm and also discusses the theoretical analysis of optimal number of slaves required for an efficient master slave implementation. The experiments are performed on three high-dimensional gene expression data. As Genetic Algorithm is a wrapper technique and takes more time to find the importance of any feature, Information Gain technique is used first as pre-processing task to remove the irrelevant features.


Author(s):  
Cheng-San Yang ◽  
◽  
Li-Yeh Chuang ◽  
Chao-Hsuan Ke ◽  
Cheng-Hong Yang ◽  
...  

Microarray data referencing to gene expression profiles provides valuable answers to a variety of problems, and contributes to advances in clinical medicine. The application of microarray data to the classification of cancer types has recently assumed increasing importance. The classification of microarray data samples involves feature selection, whose goal is to identify subsets of differentially expressed gene potentially relevant for distinguishing sample classes and classifier design. We propose an efficient evolutionary approach for selecting gene subsets from gene expression data that effectively achieves higher accuracy for classification problems. Our proposal combines a shuffled frog-leaping algorithm (SFLA) and a genetic algorithm (GA), and chooses genes (features) related to classification. The K-nearest neighbor (KNN) with leave-one-out cross validation (LOOCV) is used to evaluate classification accuracy. We apply a novel hybrid approach based on SFLA-GA and KNN classification and compare 11 classification problems from the literature. Experimental results show that classification accuracy obtained using selected features was higher than the accuracy of datasets without feature selection.


2014 ◽  
Vol 568-570 ◽  
pp. 852-857
Author(s):  
Lu Wang ◽  
Yong Quan Liang ◽  
Qi Jia Tian ◽  
Jie Yang ◽  
Chao Song ◽  
...  

Detecting community structure from complex networks has triggered considerable attention in several application domains. This paper proposes a new community detection method based on improved genetic algorithm (named CDIGA), which tries to find the best community structure by maximizing the network modularity. String encoding is used to realize genetic representation. Parts of nodes assign their community identifiers to all of their neighbors to ensure the convergence of the algorithm and eliminate unnecessary iterations when initial population is created. Crossover operator and mutation operator are improved too, one-way crossover strategy is introduced to crossover process, the Connect validity of mutation node is ensured in mutation process. We compared it with three other algorithms in computer generated networks and real world networks, Experiment Results show that the improved algorithm is highly effective for discovering community structure.


Author(s):  
Iwan Syarif

Classification problem especially for high dimensional datasets have attracted many researchers in order to find efficient approaches to address them. However, the classification problem has become very complicatedespecially when the number of possible different combinations of variables is so high. In this research, we evaluate the performance of Genetic Algorithm (GA) and Particle Swarm Optimization (PSO) as feature selection algorithms when applied to high dimensional datasets.Our experiments show that in terms of dimensionality reduction, PSO is much better than GA. PSO has successfully reduced the number of attributes of 8 datasets to 13.47% on average while GA is only 31.36% on average. In terms of classification performance, GA is slightly better than PSO. GA‐ reduced datasets have better performance than their original ones on 5 of 8 datasets while PSO is only 3 of 8 datasets.Keywords: feature selection, dimensionality reduction, Genetic Algorithm (GA), Particle Swarm Optmization (PSO).


2012 ◽  
Vol 57 (3) ◽  
pp. 829-835 ◽  
Author(s):  
Z. Głowacz ◽  
J. Kozik

The paper describes a procedure for automatic selection of symptoms accompanying the break in the synchronous motor armature winding coils. This procedure, called the feature selection, leads to choosing from a full set of features describing the problem, such a subset that would allow the best distinguishing between healthy and damaged states. As the features the spectra components amplitudes of the motor current signals were used. The full spectra of current signals are considered as the multidimensional feature spaces and their subspaces are tested. Particular subspaces are chosen with the aid of genetic algorithm and their goodness is tested using Mahalanobis distance measure. The algorithm searches for such a subspaces for which this distance is the greatest. The algorithm is very efficient and, as it was confirmed by research, leads to good results. The proposed technique is successfully applied in many other fields of science and technology, including medical diagnostics.


Sign in / Sign up

Export Citation Format

Share Document