Feature Selection using Genetic Programming

Extracting useful and novel information from the large amount of collected data has become a necessity for corporations wishing to maintain a competitive advantage. One of the biggest issues in handling these significantly large datasets is the curse of dimensionality. As the dimension of the data increases, the performance of the data mining algorithms employed to mine the data deteriorates. This deterioration is mainly caused by the large search space created as a result of having irrelevant, noisy and redundant features in the data. Feature selection is one of the various techniques that can be used to remove these unnecessary features. Feature selection consequently reduces the dimension of the data as well as the search space which in turn increases the efficiency and the accuracy of the mining algorithms. In this paper, we investigate the ability of Genetic Programming (GP), an evolutionary algorithm searching strategy capable of automatically finding solutions in complex and large search spaces, to perform feature selection. We implement a basic GP algorithm and perform feature selection on 5 benchmark classification datasets from UCI repository. To test the competitiveness and feasibility of the GP approach, we examine the classification performance of four classifiers namely J48, Naives Bayes, PART, and Random Forests using the GP selected features, all the original features and the features selected by the other commonly used feature selection techniques i.e. principal component analysis, information gain, relief-f and cfs. The experimental results show that not only does GP select a smaller set of features from the original features, classifiers using GP selected features achieve a better classification performance than using all the original features. Furthermore, compared to the other well-known feature selection techniques, GP achieves very competitive results.

Download Full-text

A Comparative Study of Feature Selection Techniques for Bat Algorithm in Various Applications

MATEC Web of Conferences ◽

10.1051/matecconf/201815006006 ◽

2018 ◽

Vol 150 ◽

pp. 06006 ◽

Cited By ~ 2

Author(s):

Rozlini Mohamed ◽

Munirah Mohd Yusof ◽

Noorhaniza Wahidi

Keyword(s):

Feature Selection ◽

Nearest Neighbor ◽

Information Gain ◽

Bat Algorithm ◽

Classification Performance ◽

Finite Sample ◽

K Nearest Neighbor ◽

Chi Square ◽

Finite Sample Size ◽

Feature Selection Techniques

Feature selection is a process to select the best feature among huge number of features in dataset, However, the problem in feature selection is to select a subset that give the better performs under some classifier. In producing better classification result, feature selection been applied in many of the classification works as part of preprocessing step; where only a subset of feature been used rather than the whole features from a particular dataset. This procedure not only can reduce the irrelevant features but in some cases able to increase classification performance due to finite sample size. In this study, Chi-Square (CH), Information Gain (IG) and Bat Algorithm (BA) are used to obtain the subset features on fourteen well-known dataset from various applications. To measure the performance of these selected features three benchmark classifier are used; k-Nearest Neighbor (kNN), Naïve Bayes (NB) and Decision Tree (DT). This paper then analyzes the performance of all classifiers with feature selection in term of accuracy, sensitivity, F-Measure and ROC. The objective of these study is to analyse the outperform feature selection techniques among conventional and heuristic techniques in various applications.

Download Full-text

Feature Selection Algorithm for High-dimensional Biomedical Data Using Information Gain and Improved Chemical Reaction Optimization

Current Bioinformatics ◽

10.2174/1574893615666200204154358 ◽

2021 ◽

Vol 15 (8) ◽

pp. 912-926

Author(s):

Ge Zhang ◽

Pan Yu ◽

Jianlin Wang ◽

Chaokun Yan

Keyword(s):

Feature Selection ◽

Chemical Reaction ◽

Information Gain ◽

Feature Selection Method ◽

Search Space ◽

Neighborhood Search ◽

Biomedical Data ◽

Chemical Reaction Optimization ◽

Search Mechanism ◽

Reaction Optimization

Background: There have been rapid developments in various bioinformatics technologies, which have led to the accumulation of a large amount of biomedical data. However, these datasets usually involve thousands of features and include much irrelevant or redundant information, which leads to confusion during diagnosis. Feature selection is a solution that consists of finding the optimal subset, which is known to be an NP problem because of the large search space. Objective: For the issue, this paper proposes a hybrid feature selection method based on an improved chemical reaction optimization algorithm (ICRO) and an information gain (IG) approach, which called IGICRO. Methods: IG is adopted to obtain some important features. The neighborhood search mechanism is combined with ICRO to increase the diversity of the population and improve the capacity of local search. Results: Experimental results of eight public available data sets demonstrate that our proposed approach outperforms original CRO and other state-of-the-art approaches.

Download Full-text

Feature Selection for Classification using Principal Component Analysis and Information Gain

Expert Systems with Applications ◽

10.1016/j.eswa.2021.114765 ◽

2021 ◽

Vol 174 ◽

pp. 114765 ◽

Cited By ~ 1

Author(s):

Erick Odhiambo Omuya ◽

George Onyango Okeyo ◽

Michael Waema Kimwele

Keyword(s):

Principal Component Analysis ◽

Feature Selection ◽

Information Gain ◽

Principal Component ◽

Component Analysis ◽

Selection For

Download Full-text

Liver Cancer Classification Model Using Hybrid Feature Selection Based on Class-Dependent Technique for the Central Region of Thailand

Information ◽

10.3390/info10060187 ◽

2019 ◽

Vol 10 (6) ◽

pp. 187

Author(s):

Rattanawadee Panthong ◽

Anongnart Srivihok

Keyword(s):

Feature Selection ◽

Liver Cancer ◽

Predictive Model ◽

Information Gain ◽

Classification Performance ◽

Cancer Classification ◽

Feature Subset Selection ◽

Classification Model ◽

Feature Subset ◽

Cancer Data

Liver cancer data always consist of a large number of multidimensional datasets. A dataset that has huge features and multiple classes may be irrelevant to the pattern classification in machine learning. Hence, feature selection improves the performance of the classification model to achieve maximum classification accuracy. The aims of the present study were to find the best feature subset and to evaluate the classification performance of the predictive model. This paper proposed a hybrid feature selection approach by combining information gain and sequential forward selection based on the class-dependent technique (IGSFS-CD) for the liver cancer classification model. Two different classifiers (decision tree and naïve Bayes) were used to evaluate feature subsets. The liver cancer datasets were obtained from the Cancer Hospital Thailand database. Three ensemble methods (ensemble classifiers, bagging, and AdaBoost) were applied to improve the performance of classification. The IGSFS-CD method provided good accuracy of 78.36% (sensitivity 0.7841 and specificity 0.9159) on LC_dataset-1. In addition, LC_dataset II delivered the best performance with an accuracy of 84.82% (sensitivity 0.8481 and specificity 0.9437). The IGSFS-CD method achieved better classification performance compared to the class-independent method. Furthermore, the best feature subset selection could help reduce the complexity of the predictive model.

Download Full-text

Automated Diagnosis System for Alzheimer Disease Using Features Selected by Artificial Bee Colony

Journal of Computational and Theoretical Nanoscience ◽

10.1166/jctn.2019.7790 ◽

2019 ◽

Vol 16 (2) ◽

pp. 682-686 ◽

Cited By ~ 3

Author(s):

Shunmuga N. Karpagam ◽

Srinivasa Raghavan

Keyword(s):

Feature Selection ◽

Artificial Bee Colony ◽

Information Gain ◽

The Elderly ◽

Bee Colony ◽

Minimal Redundancy ◽

Few Data ◽

Maximal Relevance ◽

Feature Selection Techniques ◽

The Brain

Alzheimer's disease (AD) is perhaps the most common of the forms of dementia that affects the elderly all the world over. The feature selection techniques are often used in the domains in which various features and very few data samples are available. In this work, the image of the brain is classified as either Alzheimer, the MCI or normal. The images features are extracted by the SPM12, and their best features will be chosen by using the techniques of feature selection like the Information Gain (IG), the Maximal Relevance and the Minimal Redundancy (MRMR), along with the proposed Artificial Bee Colony (ABC). The ABC is one algorithm of optimization that has taken its inspiration from the search process of honey bees for the best sources of food. The features are classified using classifiers like Logit Boost and the cascaded classifier.

Download Full-text

Chinese Sentiment Classifier Machine Learning Based on Optimized Information Gain Feature Selection

Advanced Materials Research ◽

10.4028/www.scientific.net/amr.988.511 ◽

2014 ◽

Vol 988 ◽

pp. 511-516 ◽

Cited By ~ 3

Author(s):

Jin Tao Shi ◽

Hui Liang Liu ◽

Yuan Xu ◽

Jun Feng Yan ◽

Jian Feng Xu

Keyword(s):

Machine Learning ◽

Feature Selection ◽

Word Frequency ◽

Chinese Text ◽

Information Gain ◽

Classification Performance ◽

Selection Methods ◽

Text Feature ◽

Important Solution ◽

Feature Word

Machine learning is important solution in the research of Chinese text sentiment categorization , the text feature selection is critical to the classification performance. However, the classical feature selection methods have better effect on the global categories, but it misses many representative feature words of each category. This paper presents an improved information gain method that integrates word frequency and degree of feature word sentiment into traditional information gain methods. Experiments show that classifier improved by this method has better classification .

Download Full-text

Variable-Length Particle Swarm Optimization for Feature Selection on High-Dimensional Classification

10.26686/wgtn.14273624 ◽

2021 ◽

Author(s):

B Tran ◽

Bing Xue ◽

Mengjie Zhang

Keyword(s):

Feature Selection ◽

Particle Swarm Optimization ◽

Computational Cost ◽

Particle Swarm ◽

Search Space ◽

Classification Performance ◽

Variable Length ◽

High Dimensional ◽

Swarm Optimization ◽

Local Optima

© 1997-2012 IEEE. With a global search mechanism, particle swarm optimization (PSO) has shown promise in feature selection (FS). However, most of the current PSO-based FS methods use a fix-length representation, which is inflexible and limits the performance of PSO for FS. When applying these methods to high-dimensional data, it not only consumes a significant amount of memory but also requires a high computational cost. Overcoming this limitation enables PSO to work on data with much higher dimensionality which has become more and more popular with the advance of data collection technologies. In this paper, we propose the first variable-length PSO representation for FS, enabling particles to have different and shorter lengths, which defines smaller search space and therefore, improves the performance of PSO. By rearranging features in a descending order of their relevance, we facilitate particles with shorter lengths to achieve better classification performance. Furthermore, using the proposed length changing mechanism, PSO can jump out of local optima, further narrow the search space and focus its search on smaller and more fruitful area. These strategies enable PSO to reach better solutions in a shorter time. Results on ten high-dimensional datasets with varying difficulties show that the proposed variable-length PSO can achieve much smaller feature subsets with significantly higher classification performance in much shorter time than the fixed-length PSO methods. The proposed method also outperformed the compared non-PSO FS methods in most cases. © 2019 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.

Download Full-text

A New Representation in PSO for Discretization-Based Feature Selection

10.26686/wgtn.14273582 ◽

2021 ◽

Author(s):

B Tran ◽

Bing Xue ◽

Mengjie Zhang

Keyword(s):

Feature Selection ◽

Particle Swarm Optimization ◽

Fitness Function ◽

Particle Swarm ◽

Search Space ◽

Classification Performance ◽

High Dimensional ◽

Swarm Optimization ◽

Learning Capacity ◽

Feature Interactions

In machine learning, discretization and feature selection (FS) are important techniques for preprocessing data to improve the performance of an algorithm on high-dimensional data. Since many FS methods require discrete data, a common practice is to apply discretization before FS. In addition, for the sake of efficiency, features are usually discretized individually (or univariate). This scheme works based on the assumption that each feature independently influences the task, which may not hold in cases where feature interactions exist. Therefore, univariate discretization may degrade the performance of the FS stage since information showing feature interactions may be lost during the discretization process. Initial results of our previous proposed method [evolve particle swarm optimization (EPSO)] showed that combining discretization and FS in a single stage using bare-bones particle swarm optimization (BBPSO) can lead to a better performance than applying them in two separate stages. In this paper, we propose a new method called potential particle swarm optimization (PPSO) which employs a new representation that can reduce the search space of the problem and a new fitness function to better evaluate candidate solutions to guide the search. The results on ten high-dimensional datasets show that PPSO select less than 5% of the number of features for all datasets. Compared with the two-stage approach which uses BBPSO for FS on the discretized data, PPSO achieves significantly higher accuracy on seven datasets. In addition, PPSO obtains better (or similar) classification performance than EPSO on eight datasets with a smaller number of selected features on six datasets. Furthermore, PPSO also outperforms the three compared (traditional) methods and performs similar to one method on most datasets in terms of both generalization ability and learning capacity. © 2018 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.

Download Full-text

Genetic programming for feature construction and selection in classification on high-dimensional data

10.26686/wgtn.14312465 ◽

2021 ◽

Author(s):

Binh Tran ◽

Bing Xue ◽

Mengjie Zhang

Keyword(s):

Feature Selection ◽

Genetic Programming ◽

High Dimensional Data ◽

Classification Performance ◽

High Dimensional ◽

Feature Construction ◽

Classification Problems ◽

Memetic Computing ◽

Dimensional Classification ◽

High Level

Classification on high-dimensional data with thousands to tens of thousands of dimensions is a challenging task due to the high dimensionality and the quality of the feature set. The problem can be addressed by using feature selection to choose only informative features or feature construction to create new high-level features. Genetic programming (GP) using a tree-based representation can be used for both feature construction and implicit feature selection. This work presents a comprehensive study to investigate the use of GP for feature construction and selection on high-dimensional classification problems. Different combinations of the constructed and/or selected features are tested and compared on seven high-dimensional gene expression problems, and different classification algorithms are used to evaluate their performance. The results show that the constructed and/or selected feature sets can significantly reduce the dimensionality and maintain or even increase the classification accuracy in most cases. The cases with overfitting occurred are analysed via the distribution of features. Further analysis is also performed to show why the constructed feature can achieve promising classification performance. This is a post-peer-review, pre-copyedit version of an article published in 'Memetic Computing'. The final authenticated version is available online at: https://doi.org/10.1007/s12293-015-0173-y. The following terms of use apply: https://www.springer.com/gp/open-access/publication-policies/aam-terms-of-use.

Download Full-text