A novel community detection based genetic algorithm for feature selection

AbstractThe feature selection is an essential data preprocessing stage in data mining. The core principle of feature selection seems to be to pick a subset of possible features by excluding features with almost no predictive information as well as highly associated redundant features. In the past several years, a variety of meta-heuristic methods were introduced to eliminate redundant and irrelevant features as much as possible from high-dimensional datasets. Among the main disadvantages of present meta-heuristic based approaches is that they are often neglecting the correlation between a set of selected features. In this article, for the purpose of feature selection, the authors propose a genetic algorithm based on community detection, which functions in three steps. The feature similarities are calculated in the first step. The features are classified by community detection algorithms into clusters throughout the second step. In the third step, features are picked by a genetic algorithm with a new community-based repair operation. Nine benchmark classification problems were analyzed in terms of the performance of the presented approach. Also, the authors have compared the efficiency of the proposed approach with the findings from four available algorithms for feature selection. Comparing the performance of the proposed method with three new feature selection methods based on PSO, ACO, and ABC algorithms on three classifiers showed that the accuracy of the proposed method is on average 0.52% higher than the PSO, 1.20% higher than ACO, and 1.57 higher than the ABC algorithm.

Download Full-text

A Novel Community Detection Based Genetic Algorithm for Feature Selection

10.21203/rs.3.rs-75531/v1 ◽

2020 ◽

Author(s):

Mehrdad Rostami ◽

Kamal Berahmand ◽

Saman Forouzandeh

Keyword(s):

Genetic Algorithm ◽

Feature Selection ◽

Community Detection ◽

Classification Problems ◽

Community Based ◽

New Approach ◽

Detection Algorithms ◽

Repair Operation ◽

New Community ◽

High Dimensional Datasets

Abstract The selection of features is an essential data preprocessing stage in data mining. The core principle of feature selection seems to be to pick a subset of possible features by excluding features with almost no predictive information as well as highly associated redundant features. In the past several years, a variety of meta-heuristic methods were introduced to eliminate redundant and irrelevant features as much as possible from high-dimensional datasets. Among the main disadvantages of present meta-heuristic based approaches is that they are often neglecting the correlation between a set of selected features. In this article, for the purpose of feature selection, the authors propose a genetic algorithm based on community detection, which functions in three steps. The feature similarities are calculated in the first step. The features are classified by community detection algorithms into clusters throughout the second step. In the third step, features are picked by a genetic algorithm with a new community-based repair operation. Nine benchmark classification problems were analyzed in terms of the performance of the presented approach. Also, the authors have compared the efficiency of the proposed approach with the findings from four available algorithms for feature selection. The findings indicate that the new approach continuously yields improved classification accuracy.

Download Full-text

A lazy feature selection method for multi-label classification

Intelligent Data Analysis ◽

10.3233/ida-194878 ◽

2021 ◽

Vol 25 (1) ◽

pp. 21-34

Author(s):

Rafael B. Pereira ◽

Alexandre Plastino ◽

Bianca Zadrozny ◽

Luiz H.C. Merschmann

Keyword(s):

Feature Selection ◽

Text Categorization ◽

Feature Selection Method ◽

Selection Method ◽

Video Classification ◽

Classification Problems ◽

Class Label ◽

New Feature ◽

Feature Selection Techniques ◽

Biomolecular Analysis

In many important application domains, such as text categorization, biomolecular analysis, scene or video classification and medical diagnosis, instances are naturally associated with more than one class label, giving rise to multi-label classification problems. This has led, in recent years, to a substantial amount of research in multi-label classification. More specifically, feature selection methods have been developed to allow the identification of relevant and informative features for multi-label classification. This work presents a new feature selection method based on the lazy feature selection paradigm and specific for the multi-label context. Experimental results show that the proposed technique is competitive when compared to multi-label feature selection techniques currently used in the literature, and is clearly more scalable, in a scenario where there is an increasing amount of data.

Download Full-text

A new feature selection algorithm for two-class classification problems and application to endometrial cancer

2012 IEEE 51st IEEE Conference on Decision and Control (CDC) ◽

10.1109/cdc.2012.6426819 ◽

2012 ◽

Cited By ~ 10

Author(s):

M. Eren Ahsen ◽

Nitin K. Singh ◽

Todd Boren ◽

M. Vidyasagar ◽

Michael A. White

Keyword(s):

Feature Selection ◽

Endometrial Cancer ◽

Selection Algorithm ◽

Feature Selection Algorithm ◽

Classification Problems ◽

New Feature

Download Full-text

A New Feature Selection Technique Applied to Credit Scoring Data Using a Rank Aggregation Approach Based on Optimization, Genetic Algorithm, and Similarity

Knowledge Discovery Process and Methods to Enhance Organizational Performance ◽

10.1201/b18231-21 ◽

2015 ◽

pp. 366-395

Keyword(s):

Genetic Algorithm ◽

Feature Selection ◽

Credit Scoring ◽

Rank Aggregation ◽

Feature Selection Technique ◽

Selection Technique ◽

New Feature

Download Full-text

Isolation-based feature Selection for Unsupervised Outlier Detection

Annual Conference of the PHM Society ◽

10.36001/phmconf.2019.v11i1.824 ◽

2019 ◽

Vol 11 (1) ◽

Author(s):

Qibo Yang ◽

Jaskaran Singh ◽

Jay Lee

Keyword(s):

Feature Selection ◽

Outlier Detection ◽

Simulated Data ◽

Support Vector ◽

Detection Algorithms ◽

Complex Interactions ◽

Laplacian Score ◽

High Dimensional Datasets ◽

Isolation Forest ◽

Unsupervised Outlier Detection

For high-dimensional datasets, bad features and complex interactions between features can cause high computational costs and make outlier detection algorithms inefficient. Most feature selection methods are designed for supervised classification and regression, and limited works are specifically for unsupervised outlier detection. This paper proposes a novel isolation-based feature selection (IBFS) method for unsupervised outlier detection. It is based on the training process of isolation forest. When a point of a feature is used to split the data, the imbalanced distribution of split data is measured and used to quantify how strong this feature can detect outliers. We also compare the proposed method with variance, Laplacian score and kurtosis. These methods are benchmarked on simulated data to show their characteristics. Then we evaluate the performance using one-class support vector machine, isolation forest and local outlier factor on several real-word datasets. The evaluation results show that the proposed method can improve the performance of isolation forest, and its results are similar to and sometimes better than another useful outlier indicator: kurtosis, which demonstrate the effectiveness of the proposed method. We also notice that sometimes variance and Laplacian score has similar performance on the datasets.

Download Full-text

A Master Slave Parallel Genetic Algorithm for Feature Selection in High Dimensional Datasets

International Journal of Recent Technology and Engineering - 2 ◽

10.35940/ijrte.c4184.098319 ◽

2019 ◽

Vol 8 (3) ◽

pp. 379-384

Keyword(s):

Genetic Algorithm ◽

Genetic Algorithms ◽

Feature Selection ◽

Information Gain ◽

Optimal Number ◽

Good Choice ◽

High Dimensional ◽

Parallel Genetic Algorithm ◽

Efficient Manner ◽

High Dimensional Datasets

Feature Selection in High Dimensional Datasets is a combinatorial problem as it selects the optimal subsets from N dimensional data having 2N possible subsets. Genetic Algorithms are generally a good choice for feature selection in large datasets, though for some high dimensional problems it may take varied amount of time - few seconds, few hours or even few days. Therefore, it is important to use Genetic Algorithms that can give quality results in reasonably acceptable time limit. For this purpose, it is becoming necessary to implement Genetic Algorithms in an efficient manner. In this paper, a Master Slave Parallel Genetic Algorithm is implemented as a Feature Selection procedure to diminish the time intricacies of sequential genetic algorithm. This paper describes the speed gains in parallel Master-Slave Genetic Algorithm and also discusses the theoretical analysis of optimal number of slaves required for an efficient master slave implementation. The experiments are performed on three high-dimensional gene expression data. As Genetic Algorithm is a wrapper technique and takes more time to find the importance of any feature, Information Gain technique is used first as pre-processing task to remove the irrelevant features.

Download Full-text

A redundancy-removing feature selection algorithm for nominal data

PeerJ Computer Science ◽

10.7717/peerj-cs.24 ◽

2015 ◽

Vol 1 ◽

pp. e24 ◽

Cited By ~ 1

Author(s):

Zhihua Li ◽

Wenqu Gu

Keyword(s):

Feature Selection ◽

Mutual Information ◽

Feature Selection Method ◽

Selection Method ◽

Selection Algorithm ◽

Nominal Data ◽

New Information ◽

New Feature ◽

High Dimensional Datasets ◽

Experimental Comparisons

No order correlation or similarity metric exists in nominal data, and there will always be more redundancy in a nominal dataset, which means that an efficient mutual information-based nominal-data feature selection method is relatively difficult to find. In this paper, a nominal-data feature selection method based on mutual information without data transformation, called the redundancy-removing more relevance less redundancy algorithm, is proposed. By forming several new information-related definitions and the corresponding computational methods, the proposed method can compute the information-related amount of nominal data directly. Furthermore, by creating a new evaluation function that considers both the relevance and the redundancy globally, the new feature selection method can evaluate the importance of each nominal-data feature. Although the presented feature selection method takes commonly used MIFS-like forms, it is capable of handling high-dimensional datasets without expensive computations. We perform extensive experimental comparisons of the proposed algorithm and other methods using three benchmarking nominal datasets with two different classifiers. The experimental results demonstrate the average advantage of the presented algorithm over the well-known NMIFS algorithm in terms of the feature selection and classification accuracy, which indicates that the proposed method has a promising performance.

Download Full-text

A redundancy-removing feature selection algorithm for nominal data

10.7287/peerj.preprints.1184v1 ◽

2015 ◽

Cited By ~ 1

Author(s):

Zhihua Li

Keyword(s):

Feature Selection ◽

Mutual Information ◽

Feature Selection Method ◽

Selection Method ◽

Selection Algorithm ◽

Nominal Data ◽

New Information ◽

New Feature ◽

High Dimensional Datasets ◽

Experimental Comparisons

Download Full-text

Improving Effectiveness of Intrusion Detection by Correlation Feature Selection

Contemporary Challenges and Solutions for Mobile and Multimedia Technologies ◽

10.4018/978-1-4666-2163-3.ch002 ◽

2012 ◽

pp. 22-35

Author(s):

Hai Thanh Nguyen ◽

Katrin Franke ◽

Slobodan Petrovic

Keyword(s):

Genetic Algorithm ◽

Feature Selection ◽

Linear Programming ◽

Intrusion Detection ◽

Programming Problem ◽

Linear Programming Problem ◽

Selection Procedure ◽

Filter Method ◽

New Feature ◽

Kdd Cup 99

In this paper, the authors propose a new feature selection procedure for intrusion detection, which is based on filter method used in machine learning. They focus on Correlation Feature Selection (CFS) and transform the problem of feature selection by means of CFS measure into a mixed 0-1 linear programming problem with a number of constraints and variables that is linear in the number of full set features. The mixed 0-1 linear programming problem can then be solved by using branch-and-bound algorithm. This feature selection algorithm was compared experimentally with the best-first-CFS and the genetic-algorithm-CFS methods regarding the feature selection capabilities. Classification accuracies obtained after the feature selection by means of the C4.5 and the BayesNet over the KDD CUP’99 dataset were also tested. Experiments show that the authors’ method outperforms the best-first-CFS and the genetic-algorithm-CFS methods by removing much more redundant features while keeping the classification accuracies or getting better performances.

Download Full-text

A Combination of Shuffled Frog-Leaping Algorithm and Genetic Algorithm for Gene Selection

Journal of Advanced Computational Intelligence and Intelligent Informatics ◽

10.20965/jaciii.2008.p0218 ◽

2008 ◽

Vol 12 (3) ◽

pp. 218-226 ◽

Cited By ~ 5

Author(s):

Cheng-San Yang ◽

◽

Li-Yeh Chuang ◽

Chao-Hsuan Ke ◽

Cheng-Hong Yang ◽

...

Keyword(s):

Gene Expression ◽

Genetic Algorithm ◽

Feature Selection ◽

Microarray Data ◽

Classification Accuracy ◽

Expression Profiles ◽

Classification Problems ◽

Shuffled Frog Leaping Algorithm ◽

Shuffled Frog Leaping

Microarray data referencing to gene expression profiles provides valuable answers to a variety of problems, and contributes to advances in clinical medicine. The application of microarray data to the classification of cancer types has recently assumed increasing importance. The classification of microarray data samples involves feature selection, whose goal is to identify subsets of differentially expressed gene potentially relevant for distinguishing sample classes and classifier design. We propose an efficient evolutionary approach for selecting gene subsets from gene expression data that effectively achieves higher accuracy for classification problems. Our proposal combines a shuffled frog-leaping algorithm (SFLA) and a genetic algorithm (GA), and chooses genes (features) related to classification. The K-nearest neighbor (KNN) with leave-one-out cross validation (LOOCV) is used to evaluate classification accuracy. We apply a novel hybrid approach based on SFLA-GA and KNN classification and compare 11 classification problems from the literature. Experimental results show that classification accuracy obtained using selected features was higher than the accuracy of datasets without feature selection.

Download Full-text