Feature Selection Algorithm Using Relative Odds for Data Mining Classification

Author(s):  
Donald Douglas Atsa'am

A filter feature selection algorithm is developed and its performance tested. In the initial step, the algorithm dichotomizes the dataset then separately computes the association between each predictor and the class variable using relative odds (odds ratios). The value of the odds ratios becomes the importance ranking of the corresponding explanatory variable in determining the output. Logistic regression classification is deployed to test the performance of the new algorithm in comparison with three existing feature selection algorithms: the Fisher index, Pearson's correlation, and the varImp function. A number of experimental datasets are employed, and in most cases, the subsets selected by the new algorithm produced models with higher classification accuracy than the subsets suggested by the existing feature selection algorithms. Therefore, the proposed algorithm is a reliable alternative in filter feature selection for binary classification problems.

2013 ◽  
Vol 22 (04) ◽  
pp. 1350027
Author(s):  
JAGANATHAN PALANICHAMY ◽  
KUPPUCHAMY RAMASAMY

Feature selection is essential in data mining and pattern recognition, especially for database classification. During past years, several feature selection algorithms have been proposed to measure the relevance of various features to each class. A suitable feature selection algorithm normally maximizes the relevancy and minimizes the redundancy of the selected features. The mutual information measure can successfully estimate the dependency of features on the entire sampling space, but it cannot exactly represent the redundancies among features. In this paper, a novel feature selection algorithm is proposed based on maximum relevance and minimum redundancy criterion. The mutual information is used to measure the relevancy of each feature with class variable and calculate the redundancy by utilizing the relationship between candidate features, selected features and class variables. The effectiveness is tested with ten benchmarked datasets available in UCI Machine Learning Repository. The experimental results show better performance when compared with some existing algorithms.


Author(s):  
Qi-Guang Miao ◽  
Ying Cao ◽  
Jian-Feng Song ◽  
Jiachen Liu ◽  
Yining Quan

In a learning process, features play a fundamental role. In this paper, we propose a Boosting-based feature selection algorithm called BoostFS. It extends AdaBoost which is designed for classification problems to feature selection. BoostFS maintains a distribution over training samples which is initialized from the uniform distribution. In each iteration, a decision stump is trained under the sample distribution and then the sample distribution is adjusted so that it is orthogonal to the classification results of all the generated stumps. Because a decision stump can also be regarded as one selected feature, BoostFS is capable to select a subset of features that are irrelevant to each other as much as possible. Experimental results on synthetic datasets, five UCI datasets and a real malware detection dataset all show that the features selected by BoostFS help to improve learning algorithms in classification problems, especially when the original feature set contains redundant features.


2017 ◽  
Vol 16 (05) ◽  
pp. 1309-1338 ◽  
Author(s):  
Pin Wang ◽  
Yongming Li ◽  
Bohan Chen ◽  
Xianling Hu ◽  
Jin Yan ◽  
...  

Feature selection is an important research field for pattern classification, data mining, etc. Population-based optimization algorithms (POA) have high parallelism and are widely used as search algorithm for feature selection. Population-based feature selection algorithms (PFSA) involve compromise between precision and time cost. In order to optimize the PFSA, the feature selection models need to be improved. Feature selection algorithms broadly fall into two categories: the filter model and the wrapper model. The filter model is fast but less precise; while the wrapper model is more precise but generally computationally more intensive. In this paper, we proposed a new mechanism — proportional hybrid mechanism (PHM) to combine the advantages of filter and wrapper models. The mechanism can be applied in PFSA to improve their performance. Genetic algorithm (GA) has been applied in many kinds of feature selection problems as search algorithm because of its high efficiency and implicit parallelism. Therefore, GAs are used in this paper. In order to validate the mechanism, seven datasets from university of California Irvine (UCI) database and artificial toy datasets are tested. The experiments are carried out for different GAs, classifiers, and evaluation criteria, the results show that with the introduction of PHM, the GA-based feature selection algorithm can be improved in both time cost and classification accuracy. Moreover, the comparison of GA-based, PSO-based and some other feature selection algorithms demonstrate that the PHM can be used in other population-based feature selection algorithms and obtain satisfying results.


2021 ◽  
Vol 2021 ◽  
pp. 1-10
Author(s):  
Li Zhang

Feature selection is the key step in the analysis of high-dimensional small sample data. The core of feature selection is to analyse and quantify the correlation between features and class labels and the redundancy between features. However, most of the existing feature selection algorithms only consider the classification contribution of individual features and ignore the influence of interfeature redundancy and correlation. Therefore, this paper proposes a feature selection algorithm for nonlinear dynamic conditional relevance (NDCRFS) through the study and analysis of the existing feature selection algorithm ideas and method. Firstly, redundancy and relevance between features and between features and class labels are discriminated by mutual information, conditional mutual information, and interactive mutual information. Secondly, the selected features and candidate features are dynamically weighted utilizing information gain factors. Finally, to evaluate the performance of this feature selection algorithm, NDCRFS was validated against 6 other feature selection algorithms on three classifiers, using 12 different data sets, for variability and classification metrics between the different algorithms. The experimental results show that the NDCRFS method can improve the quality of the feature subsets and obtain better classification results.


2021 ◽  
Vol 10 (6) ◽  
pp. 3501-3506
Author(s):  
S. J. Sushma ◽  
Tsehay Admassu Assegie ◽  
D. C. Vinutha ◽  
S. Padmashree

Irrelevant feature in heart disease dataset affects the performance of binary classification model. Consequently, eliminating irrelevant and redundant feature (s) from training set with feature selection algorithm significantly improves the performance of classification model on heart disease detection. Sequential feature selection (SFS) is successful algorithm to improve the performance of classification model on heart disease detection and reduces the computational time complexity. In this study, sequential feature selection (SFS) algorithm is implemented for improving the classifier performance on heart disease detection by removing irrelevant features and training a model on optimal features. Furthermore, exhaustive and permutation based feature selection algorithm are implemented and compared with SFS algorithm. The implemented and existing feature selection algorithms are evaluated using real world Pima Indian heart disease dataset and result appears to prove that the SFS algorithm outperforms as compared to exhaustive and permutation based feature selection algorithm. Overall, the result looks promising and more effective heart disease detection model is developed with accuracy of 99.3%.


Entropy ◽  
2021 ◽  
Vol 23 (6) ◽  
pp. 704
Author(s):  
Jiucheng Xu ◽  
Kanglin Qu ◽  
Meng Yuan ◽  
Jie Yang

Feature selection is one of the core contents of rough set theory and application. Since the reduction ability and classification performance of many feature selection algorithms based on rough set theory and its extensions are not ideal, this paper proposes a feature selection algorithm that combines the information theory view and algebraic view in the neighborhood decision system. First, the neighborhood relationship in the neighborhood rough set model is used to retain the classification information of continuous data, to study some uncertainty measures of neighborhood information entropy. Second, to fully reflect the decision ability and classification performance of the neighborhood system, the neighborhood credibility and neighborhood coverage are defined and introduced into the neighborhood joint entropy. Third, a feature selection algorithm based on neighborhood joint entropy is designed, which improves the disadvantage that most feature selection algorithms only consider information theory definition or algebraic definition. Finally, experiments and statistical analyses on nine data sets prove that the algorithm can effectively select the optimal feature subset, and the selection result can maintain or improve the classification performance of the data set.


Algorithms ◽  
2021 ◽  
Vol 14 (3) ◽  
pp. 100
Author(s):  
Werner Mostert ◽  
Katherine M. Malan ◽  
Andries P. Engelbrecht

This study presents a novel performance metric for feature selection algorithms that is unbiased and can be used for comparative analysis across feature selection problems. The baseline fitness improvement (BFI) measure quantifies the potential value gained by applying feature selection. The BFI measure can be used to compare the performance of feature selection algorithms across datasets by measuring the change in classifier performance as a result of feature selection, with respect to the baseline where all features are included. Empirical results are presented to show that there is performance complementarity for a suite of feature selection algorithms on a variety of real world datasets. The BFI measure is a normalised performance metric that can be used to correlate problem characteristics with feature selection algorithm performance, across multiple datasets. This ability paves the way towards describing the performance space of the per-instance algorithm selection problem for feature selection algorithms.


Author(s):  
Yuanyuan Han ◽  
Lan Huang ◽  
Fengfeng Zhou

Abstract Motivation A feature selection algorithm may select the subset of features with the best associations with the class labels. The recursive feature elimination (RFE) is a heuristic feature screening framework and has been widely used to select the biological OMIC biomarkers. This study proposed a dynamic recursive feature elimination (dRFE) framework with more flexible feature elimination operations. The proposed dRFE was comprehensively compared with 11 existing feature selection algorithms and five classifiers on the eight difficult transcriptome datasets from a previous study, the ten newly collected transcriptome datasets and the five methylome datasets. Results The experimental data suggested that the regular RFE framework did not perform well, and dRFE outperformed the existing feature selection algorithms in most cases. The dRFE-detected features achieved Acc = 1.0000 for the two methylome datasets GSE53045 and GSE66695. The best prediction accuracies of the dRFE-detected features were 0.9259, 0.9424 and 0.8601 for the other three methylome datasets GSE74845, GSE103186 and GSE80970, respectively. Four transcriptome datasets received Acc = 1.0000 using the dRFE-detected features, and the prediction accuracies for the other six newly collected transcriptome datasets were between 0.6301 and 0.9917. Availability and implementation The experiments in this study are implemented and tested using the programming language Python version 3.7.6. Supplementary information Supplementary data are available at Bioinformatics online.


Sign in / Sign up

Export Citation Format

Share Document