A Hybrid Feature Selection Based on Mutual Information and Genetic Algorithm

Author(s):  
Yuan-Dong Lan

Feature selection aims to choose an optimal subset of features that are necessary and sufficient to improve the generalization performance and the running efficiency of the learning algorithm. To get the optimal subset in the feature selection process, a hybrid feature selection based on mutual information and genetic algorithm is proposed in this paper. In order to make full use of the advantages of filter and wrapper model, the algorithm is divided into two phases: the filter phase and the wrapper phase. In the filter phase, this algorithm first uses the mutual information to sort the feature, and provides the heuristic information for the subsequent genetic algorithm, to accelerate the search process of the genetic algorithm. In the wrapper phase, using the genetic algorithm as the search strategy, considering the performance of the classifier and dimension of subset as an evaluation criterion, search the best subset of features. Experimental results on benchmark datasets show that the proposed algorithm has higher classification accuracy and smaller feature dimension, and its running time is less than the time of using genetic algorithm.

Author(s):  
Gang Liu ◽  
Chunlei Yang ◽  
Sen Liu ◽  
Chunbao Xiao ◽  
Bin Song

A feature selection method based on mutual information and support vector machine (SVM) is proposed in order to eliminate redundant feature and improve classification accuracy. First, local correlation between features and overall correlation is calculated by mutual information. The correlation reflects the information inclusion relationship between features, so the features are evaluated and redundant features are eliminated with analyzing the correlation. Subsequently, the concept of mean impact value (MIV) is defined and the influence degree of input variables on output variables for SVM network based on MIV is calculated. The importance weights of the features described with MIV are sorted by descending order. Finally, the SVM classifier is used to implement feature selection according to the classification accuracy of feature combination which takes MIV order of feature as a reference. The simulation experiments are carried out with three standard data sets of UCI, and the results show that this method can not only effectively reduce the feature dimension and high classification accuracy, but also ensure good robustness.


Author(s):  
Rahul Hans ◽  
Harjot Kaur

It can be acknowledged from the literature that the high density of breast tissue is a root cause for the escalation of breast cancer among the women, imparting its prime role in Cancer Death among women. Moreover, in this era where computer-aided diagnosis systems have become the right hand of the radiologists, the researchers still find room for improvement in the feature selection techniques. This research aspires to propose hybrid versions of Biogeography-Based Optimization and Genetic Algorithm for feature selection in Breast Density Classification, to get rid of redundant and irrelevant features from the dataset; along with it to achieve the superior classification accuracy or to uphold the same accuracy with lesser number of features. For experimentation, 322 mammogram images from mini-MIAS database are chosen, and then Region of Interests (ROI) of seven different sizes are extracted to extract a set of 45 texture features corresponding to each ROI. Subsequently, the proposed algorithms are used to extract an optimal subset of features from the hefty set of features corresponding to each ROI. The results indicate the outperformance of the proposed algorithms when results were compared with some of the other nature-inspired metaheuristic algorithms using various parameters.


BMC Genomics ◽  
2020 ◽  
Vol 21 (1) ◽  
Author(s):  
Zhixun Zhao ◽  
Xiaocai Zhang ◽  
Fang Chen ◽  
Liang Fang ◽  
Jinyan Li

Abstract Background DNA N4-methylcytosine (4mC) is a critical epigenetic modification and has various roles in the restriction-modification system. Due to the high cost of experimental laboratory detection, computational methods using sequence characteristics and machine learning algorithms have been explored to identify 4mC sites from DNA sequences. However, state-of-the-art methods have limited performance because of the lack of effective sequence features and the ad hoc choice of learning algorithms to cope with this problem. This paper is aimed to propose new sequence feature space and a machine learning algorithm with feature selection scheme to address the problem. Results The feature importance score distributions in datasets of six species are firstly reported and analyzed. Then the impact of the feature selection on model performance is evaluated by independent testing on benchmark datasets, where ACC and MCC measurements on the performance after feature selection increase by 2.3% to 9.7% and 0.05 to 0.19, respectively. The proposed method is compared with three state-of-the-art predictors using independent test and 10-fold cross-validations, and our method outperforms in all datasets, especially improving the ACC by 3.02% to 7.89% and MCC by 0.06 to 0.15 in the independent test. Two detailed case studies by the proposed method have confirmed the excellent overall performance and correctly identified 24 of 26 4mC sites from the C.elegans gene, and 126 out of 137 4mC sites from the D.melanogaster gene. Conclusions The results show that the proposed feature space and learning algorithm with feature selection can improve the performance of DNA 4mC prediction on the benchmark datasets. The two case studies prove the effectiveness of our method in practical situations.


2018 ◽  
Vol 6 (1) ◽  
pp. 58-72
Author(s):  
Omar A. M. Salem ◽  
Liwei Wang

Building classification models from real-world datasets became a difficult task, especially in datasets with high dimensional features. Unfortunately, these datasets may include irrelevant or redundant features which have a negative effect on the classification performance. Selecting the significant features and eliminating undesirable features can improve the classification models. Fuzzy mutual information is widely used feature selection to find the best feature subset before classification process. However, it requires more computation and storage space. To overcome these limitations, this paper proposes an improved fuzzy mutual information feature selection based on representative samples. Based on benchmark datasets, the experiments show that the proposed method achieved better results in the terms of classification accuracy, selected feature subset size, storage, and stability.


Author(s):  
J Qu ◽  
Z Liu ◽  
M J Zuo ◽  
H-Z Huang

Feature selection is an effective way of improving classification, reducing feature dimension, and speeding up computation. This work studies a reported support vector machine (SVM) based method of feature selection. Our results reveal discrepancies in both its feature ranking and feature selection schemes. Modifications are thus made on which our SVM-based method of feature selection is proposed. Using the weighting fusion technique and the one-against-all approach, our binary model has been extensively updated for multi-class classification problems. Three benchmark datasets are employed to demonstrate the performance of the proposed method. The multi-class model of the proposed method is also used for feature selection in planetary gear damage degree classification. The results of all datasets exhibit the consistently effective classification made possible by the proposed method.


2012 ◽  
Vol 165 ◽  
pp. 232-236 ◽  
Author(s):  
Mohd Haniff Osman ◽  
Z.M. Nopiah ◽  
S. Abdullah

Having relevant features for representing dataset would motivate such algorithms to provide a highly accurate classification system in less-consuming time. Unfortunately, one good set of features is sometimes not fit to all learning algorithms. To confirm that learning algorithm selection does not weights system accuracy user has to validate that the given dataset is a feature-oriented dataset. Thus, in this study we propose a simple verification procedure based on multi objective approach by means of elitist Non-dominated Sorting in Genetic Algorithm (NSGA-II). The way NSGA-II performs in this work is quite similar to the feature selection procedure except on interpretation of the results i.e. set of optimal solutions. Two conflicting minimization elements namely classification error and number of used features are taken as objective functions. A case study of fatigue segment classification was chosen for the purpose of this study where simulations were repeated using four single classifiers such as Naive-Bayes, k nearest neighbours, decision tree and radial basis function. The proposed procedure demonstrates that only two features are needed for classifying a fatigue segment task without having to place concern on learning algorithm


Sign in / Sign up

Export Citation Format

Share Document