Feature Selection Algorithm Using Relative Odds for Data Mining Classification

Classification Problems ◽

Odds Ratios ◽

Relative Odds ◽

Importance Ranking ◽

A filter feature selection algorithm is developed and its performance tested. In the initial step, the algorithm dichotomizes the dataset then separately computes the association between each predictor and the class variable using relative odds (odds ratios). The value of the odds ratios becomes the importance ranking of the corresponding explanatory variable in determining the output. Logistic regression classification is deployed to test the performance of the new algorithm in comparison with three existing feature selection algorithms: the Fisher index, Pearson's correlation, and the varImp function. A number of experimental datasets are employed, and in most cases, the subsets selected by the new algorithm produced models with higher classification accuracy than the subsets suggested by the existing feature selection algorithms. Therefore, the proposed algorithm is a reliable alternative in filter feature selection for binary classification problems.

A new feature selection algorithm for two-class classification problems and application to endometrial cancer

2012 IEEE 51st IEEE Conference on Decision and Control (CDC) ◽

10.1109/cdc.2012.6426819 ◽

2012 ◽

Cited By ~ 10

Author(s):

M. Eren Ahsen ◽

Nitin K. Singh ◽

Todd Boren ◽

M. Vidyasagar ◽

Michael A. White

Keyword(s):

Feature Selection ◽

Endometrial Cancer ◽

Selection Algorithm ◽

Classification Problems ◽

New Feature

A NOVEL FEATURE SELECTION ALGORITHM WITH SUPERVISED MUTUAL INFORMATION FOR CLASSIFICATION

International Journal of Artificial Intelligence Tools ◽

10.1142/s0218213013500279 ◽

2013 ◽

Vol 22 (04) ◽

pp. 1350027

Author(s):

JAGANATHAN PALANICHAMY ◽

KUPPUCHAMY RAMASAMY

Keyword(s):

Machine Learning ◽

Data Mining ◽

Feature Selection ◽

Mutual Information ◽

Selection Algorithm ◽

Class A ◽

Selection Algorithms ◽

The Relationship ◽

Class Variable

Feature selection is essential in data mining and pattern recognition, especially for database classification. During past years, several feature selection algorithms have been proposed to measure the relevance of various features to each class. A suitable feature selection algorithm normally maximizes the relevancy and minimizes the redundancy of the selected features. The mutual information measure can successfully estimate the dependency of features on the entire sampling space, but it cannot exactly represent the redundancies among features. In this paper, a novel feature selection algorithm is proposed based on maximum relevance and minimum redundancy criterion. The mutual information is used to measure the relevancy of each feature with class variable and calculate the redundancy by utilizing the relationship between candidate features, selected features and class variables. The effectiveness is tested with ten benchmarked datasets available in UCI Machine Learning Repository. The experimental results show better performance when compared with some existing algorithms.

BoostFS: A Boosting-Based Irrelevant Feature Selection Algorithm

International Journal of Pattern Recognition and Artificial Intelligence ◽

10.1142/s0218001415510118 ◽

2015 ◽

Vol 29 (07) ◽

pp. 1551011 ◽

Cited By ~ 3

Author(s):

Qi-Guang Miao ◽

Ying Cao ◽

Jian-Feng Song ◽

Jiachen Liu ◽

Yining Quan

Keyword(s):

Feature Selection ◽

Learning Process ◽

Selection Algorithm ◽

Classification Problems ◽

Sample Distribution ◽

Irrelevant Feature ◽

Training Samples ◽

Synthetic Datasets ◽

Original Feature

In a learning process, features play a fundamental role. In this paper, we propose a Boosting-based feature selection algorithm called BoostFS. It extends AdaBoost which is designed for classification problems to feature selection. BoostFS maintains a distribution over training samples which is initialized from the uniform distribution. In each iteration, a decision stump is trained under the sample distribution and then the sample distribution is adjusted so that it is orthogonal to the classification results of all the generated stumps. Because a decision stump can also be regarded as one selected feature, BoostFS is capable to select a subset of features that are irrelevant to each other as much as possible. Experimental results on synthetic datasets, five UCI datasets and a real malware detection dataset all show that the features selected by BoostFS help to improve learning algorithms in classification problems, especially when the original feature set contains redundant features.

An improved feature selection algorithm with conditional mutual information for classification problems

2013 International Conference on Human Computer Interactions (ICHCI) ◽

10.1109/ichci-ieee.2013.6887802 ◽

2013 ◽

Cited By ~ 1

Author(s):

Jaganathan Palanichamy ◽

Kuppuchamy Ramasamy

Keyword(s):

Feature Selection ◽

Mutual Information ◽

Conditional Mutual Information ◽

Selection Algorithm ◽

Classification Problems

Proportional Hybrid Mechanism for Population Based Feature Selection Algorithm

International Journal of Information Technology & Decision Making ◽

10.1142/s0219622014500096 ◽

2017 ◽

Vol 16 (05) ◽

pp. 1309-1338 ◽

Cited By ~ 4

Author(s):

Pin Wang ◽

Yongming Li ◽

Bohan Chen ◽

Xianling Hu ◽

Jin Yan ◽

...

Keyword(s):

Feature Selection ◽

High Efficiency ◽

Search Algorithm ◽

Population Based ◽

Time Cost ◽

Selection Algorithm ◽

Filter Model ◽

Hybrid Mechanism ◽

Feature selection is an important research field for pattern classification, data mining, etc. Population-based optimization algorithms (POA) have high parallelism and are widely used as search algorithm for feature selection. Population-based feature selection algorithms (PFSA) involve compromise between precision and time cost. In order to optimize the PFSA, the feature selection models need to be improved. Feature selection algorithms broadly fall into two categories: the filter model and the wrapper model. The filter model is fast but less precise; while the wrapper model is more precise but generally computationally more intensive. In this paper, we proposed a new mechanism — proportional hybrid mechanism (PHM) to combine the advantages of filter and wrapper models. The mechanism can be applied in PFSA to improve their performance. Genetic algorithm (GA) has been applied in many kinds of feature selection problems as search algorithm because of its high efficiency and implicit parallelism. Therefore, GAs are used in this paper. In order to validate the mechanism, seven datasets from university of California Irvine (UCI) database and artificial toy datasets are tested. The experiments are carried out for different GAs, classifiers, and evaluation criteria, the results show that with the introduction of PHM, the GA-based feature selection algorithm can be improved in both time cost and classification accuracy. Moreover, the comparison of GA-based, PSO-based and some other feature selection algorithms demonstrate that the PHM can be used in other population-based feature selection algorithms and obtain satisfying results.

A Feature Selection Algorithm Integrating Maximum Classification Information and Minimum Interaction Feature Dependency Information

Computational Intelligence and Neuroscience ◽

10.1155/2021/3569632 ◽

2021 ◽

Vol 2021 ◽

pp. 1-10

Author(s):

Li Zhang

Keyword(s):

Feature Selection ◽

Mutual Information ◽

Information Gain ◽

Small Sample ◽

Selection Algorithm ◽

Class Labels ◽

Minimum Interaction ◽

Classification Information ◽

Feature selection is the key step in the analysis of high-dimensional small sample data. The core of feature selection is to analyse and quantify the correlation between features and class labels and the redundancy between features. However, most of the existing feature selection algorithms only consider the classification contribution of individual features and ignore the influence of interfeature redundancy and correlation. Therefore, this paper proposes a feature selection algorithm for nonlinear dynamic conditional relevance (NDCRFS) through the study and analysis of the existing feature selection algorithm ideas and method. Firstly, redundancy and relevance between features and between features and class labels are discriminated by mutual information, conditional mutual information, and interactive mutual information. Secondly, the selected features and candidate features are dynamically weighted utilizing information gain factors. Finally, to evaluate the performance of this feature selection algorithm, NDCRFS was validated against 6 other feature selection algorithms on three classifiers, using 12 different data sets, for variability and classification metrics between the different algorithms. The experimental results show that the NDCRFS method can improve the quality of the feature subsets and obtain better classification results.

An improved feature selection approach for chronic heart disease detection

Bulletin of Electrical Engineering and Informatics ◽

10.11591/eei.v10i6.3001 ◽

2021 ◽

Vol 10 (6) ◽

pp. 3501-3506

Author(s):

S. J. Sushma ◽

Tsehay Admassu Assegie ◽

D. C. Vinutha ◽

S. Padmashree

Keyword(s):

Feature Selection ◽

Heart Disease ◽

Binary Classification ◽

Classification Model ◽

Computational Time ◽

Disease Detection ◽

Selection Algorithm ◽

Detection Model ◽

Sequential Feature Selection

Irrelevant feature in heart disease dataset affects the performance of binary classification model. Consequently, eliminating irrelevant and redundant feature (s) from training set with feature selection algorithm significantly improves the performance of classification model on heart disease detection. Sequential feature selection (SFS) is successful algorithm to improve the performance of classification model on heart disease detection and reduces the computational time complexity. In this study, sequential feature selection (SFS) algorithm is implemented for improving the classifier performance on heart disease detection by removing irrelevant features and training a model on optimal features. Furthermore, exhaustive and permutation based feature selection algorithm are implemented and compared with SFS algorithm. The implemented and existing feature selection algorithms are evaluated using real world Pima Indian heart disease dataset and result appears to prove that the SFS algorithm outperforms as compared to exhaustive and permutation based feature selection algorithm. Overall, the result looks promising and more effective heart disease detection model is developed with accuracy of 99.3%.

Feature Selection Combining Information Theory View and Algebraic View in the Neighborhood Decision System

Entropy ◽

10.3390/e23060704 ◽

2021 ◽

Vol 23 (6) ◽

pp. 704

Author(s):

Jiucheng Xu ◽

Kanglin Qu ◽

Meng Yuan ◽

Jie Yang

Keyword(s):

Information Theory ◽

Feature Selection ◽

Rough Set ◽

Rough Set Theory ◽

Classification Performance ◽

Joint Entropy ◽

Selection Algorithm ◽

Selection Algorithms ◽

Theory View

Feature selection is one of the core contents of rough set theory and application. Since the reduction ability and classification performance of many feature selection algorithms based on rough set theory and its extensions are not ideal, this paper proposes a feature selection algorithm that combines the information theory view and algebraic view in the neighborhood decision system. First, the neighborhood relationship in the neighborhood rough set model is used to retain the classification information of continuous data, to study some uncertainty measures of neighborhood information entropy. Second, to fully reflect the decision ability and classification performance of the neighborhood system, the neighborhood credibility and neighborhood coverage are defined and introduced into the neighborhood joint entropy. Third, a feature selection algorithm based on neighborhood joint entropy is designed, which improves the disadvantage that most feature selection algorithms only consider information theory definition or algebraic definition. Finally, experiments and statistical analyses on nine data sets prove that the algorithm can effectively select the optimal feature subset, and the selection result can maintain or improve the classification performance of the data set.

A Feature Selection Algorithm Performance Metric for Comparative Analysis

Algorithms ◽

10.3390/a14030100 ◽

2021 ◽

Vol 14 (3) ◽

pp. 100

Author(s):

Werner Mostert ◽

Katherine M. Malan ◽

Andries P. Engelbrecht

Keyword(s):

Feature Selection ◽

Comparative Analysis ◽

Selection Algorithm ◽

Algorithm Selection ◽

Algorithm Performance ◽

Performance Space ◽

Performance Metric ◽

Real World Datasets ◽

This study presents a novel performance metric for feature selection algorithms that is unbiased and can be used for comparative analysis across feature selection problems. The baseline fitness improvement (BFI) measure quantifies the potential value gained by applying feature selection. The BFI measure can be used to compare the performance of feature selection algorithms across datasets by measuring the change in classifier performance as a result of feature selection, with respect to the baseline where all features are included. Empirical results are presented to show that there is performance complementarity for a suite of feature selection algorithms on a variety of real world datasets. The BFI measure is a normalised performance metric that can be used to correlate problem characteristics with feature selection algorithm performance, across multiple datasets. This ability paves the way towards describing the performance space of the per-instance algorithm selection problem for feature selection algorithms.

A dynamic recursive feature elimination framework (dRFE) to further refine a set of OMIC biomarkers

Bioinformatics ◽

10.1093/bioinformatics/btab055 ◽

2021 ◽

Author(s):

Yuanyuan Han ◽

Lan Huang ◽

Fengfeng Zhou

Keyword(s):

Experimental Data ◽

Feature Selection ◽

The Other ◽

Supplementary Information ◽

Recursive Feature Elimination ◽

Supplementary Data ◽

Selection Algorithm ◽

Class Labels ◽

Abstract Motivation A feature selection algorithm may select the subset of features with the best associations with the class labels. The recursive feature elimination (RFE) is a heuristic feature screening framework and has been widely used to select the biological OMIC biomarkers. This study proposed a dynamic recursive feature elimination (dRFE) framework with more flexible feature elimination operations. The proposed dRFE was comprehensively compared with 11 existing feature selection algorithms and five classifiers on the eight difficult transcriptome datasets from a previous study, the ten newly collected transcriptome datasets and the five methylome datasets. Results The experimental data suggested that the regular RFE framework did not perform well, and dRFE outperformed the existing feature selection algorithms in most cases. The dRFE-detected features achieved Acc = 1.0000 for the two methylome datasets GSE53045 and GSE66695. The best prediction accuracies of the dRFE-detected features were 0.9259, 0.9424 and 0.8601 for the other three methylome datasets GSE74845, GSE103186 and GSE80970, respectively. Four transcriptome datasets received Acc = 1.0000 using the dRFE-detected features, and the prediction accuracies for the other six newly collected transcriptome datasets were between 0.6301 and 0.9917. Availability and implementation The experiments in this study are implemented and tested using the programming language Python version 3.7.6. Supplementary information Supplementary data are available at Bioinformatics online.