Feature Selection Based on Minimizing the Area Under the Detection Error Tradeoff Curve

2011 ◽  
Vol 2 (1) ◽  
pp. 18-33 ◽  
Author(s):  
Liau Heng Fui ◽  
Dino Isa

Feature selection is crucial to select an “optimized” subset of features from the original feature set based on a certain objective function. In general, feature selection removes redundant or irrelevant data while retaining classification accuracy. This paper proposes a feature selection algorithm that aims to minimize the area under the curve of detection error trade-off (DET) curve. Particle swarm optimization (PSO) is employed to search for the optimal feature subset. The proposed method is implemented in face recognition and iris recognition systems. The result shows that the proposed method is able to find an optimal subset of features that sufficiently describes iris and face images by removing unwanted and redundant features and at the same time improving the classification accuracy in terms of total error rate (TER).

Author(s):  
Liau Heng Fui ◽  
Dino Isa

Feature selection is crucial to select an “optimized” subset of features from the original feature set based on a certain objective function. In general, feature selection removes redundant or irrelevant data while retaining classification accuracy. This paper proposes a feature selection algorithm that aims to minimize the area under the curve of detection error trade-off (DET) curve. Particle swarm optimization (PSO) is employed to search for the optimal feature subset. The proposed method is implemented in face recognition and iris recognition systems. The result shows that the proposed method is able to find an optimal subset of features that sufficiently describes iris and face images by removing unwanted and redundant features and at the same time improving the classification accuracy in terms of total error rate (TER).


2012 ◽  
Vol 532-533 ◽  
pp. 1497-1502
Author(s):  
Hong Mei Li ◽  
Lin Gen Yang ◽  
Li Hua Zou

To make feature subset which can gain the higher classification accuracy rate, the method based on genetic algorithms and the feature selection of support vector machine is proposed. Firstly, the ReliefF algorithm provides a priori information to GA, the parameters of the support vector machine mixed into the genetic encoding,and then using genetic algorithm finds the optimal feature subset and support vector machines parameter combination. Finally, experimental results show that the proposed algorithm can gain the higher classification accuracy rate based on the smaller feature subset.


Author(s):  
Ilangovan Sangaiya ◽  
A. Vincent Antony Kumar

In data mining, people require feature selection to select relevant features and to remove unimportant irrelevant features from a original data set based on some evolution criteria. Filter and wrapper are the two methods used but here the authors have proposed a hybrid feature selection method to take advantage of both methods. The proposed method uses symmetrical uncertainty and genetic algorithms for selecting the optimal feature subset. This has been done so as to improve processing time by reducing the dimension of the data set without compromising the classification accuracy. This proposed hybrid algorithm is much faster and scales well to the data set in terms of selected features, classification accuracy and running time than most existing algorithms.


Genes ◽  
2021 ◽  
Vol 12 (3) ◽  
pp. 354
Author(s):  
Lu Zhang ◽  
Xinyi Qin ◽  
Min Liu ◽  
Ziwei Xu ◽  
Guangzhong Liu

As a prevalent existing post-transcriptional modification of RNA, N6-methyladenosine (m6A) plays a crucial role in various biological processes. To better radically reveal its regulatory mechanism and provide new insights for drug design, the accurate identification of m6A sites in genome-wide is vital. As the traditional experimental methods are time-consuming and cost-prohibitive, it is necessary to design a more efficient computational method to detect the m6A sites. In this study, we propose a novel cross-species computational method DNN-m6A based on the deep neural network (DNN) to identify m6A sites in multiple tissues of human, mouse and rat. Firstly, binary encoding (BE), tri-nucleotide composition (TNC), enhanced nucleic acid composition (ENAC), K-spaced nucleotide pair frequencies (KSNPFs), nucleotide chemical property (NCP), pseudo dinucleotide composition (PseDNC), position-specific nucleotide propensity (PSNP) and position-specific dinucleotide propensity (PSDP) are employed to extract RNA sequence features which are subsequently fused to construct the initial feature vector set. Secondly, we use elastic net to eliminate redundant features while building the optimal feature subset. Finally, the hyper-parameters of DNN are tuned with Bayesian hyper-parameter optimization based on the selected feature subset. The five-fold cross-validation test on training datasets show that the proposed DNN-m6A method outperformed the state-of-the-art method for predicting m6A sites, with an accuracy (ACC) of 73.58%–83.38% and an area under the curve (AUC) of 81.39%–91.04%. Furthermore, the independent datasets achieved an ACC of 72.95%–83.04% and an AUC of 80.79%–91.09%, which shows an excellent generalization ability of our proposed method.


2021 ◽  
pp. 1-18
Author(s):  
Zhang Zixian ◽  
Liu Xuning ◽  
Li Zhixiang ◽  
Hu Hongqiang

The influencing factors of coal and gas outburst are complex, now the accuracy and efficiency of outburst prediction and are not high, in order to obtain the effective features from influencing factors and realize the accurate and fast dynamic prediction of coal and gas outburst, this article proposes an outburst prediction model based on the coupling of feature selection and intelligent optimization classifier. Firstly, in view of the redundancy and irrelevance of the influencing factors of coal and gas outburst, we use Boruta feature selection method obtain the optimal feature subset from influencing factors of coal and gas outburst. Secondly, based on Apriori association rules mining method, the internal association relationship between coal and gas outburst influencing factors is mined, and the strong association rules existing in the influencing factors and samples that affect the classification of coal and gas outburst are extracted. Finally, svm is used to classify coal and gas outbursts based on the above obtained optimal feature subset and sample data, and Bayesian optimization algorithm is used to optimize the kernel parameters of svm, and the coal and gas outburst pattern recognition prediction model is established, which is compared with the existing coal and gas outbursts prediction model in literatures. Compared with the method of feature selection and association rules mining alone, the proposed model achieves the highest prediction accuracy of 93% when the feature dimension is 3, which is higher than that of Apriori association rules and Boruta feature selection, and the classification accuracy is significantly improved, However, the feature dimension decreased significantly; The results show that the proposed model is better than other prediction models, which further verifies the accuracy and applicability of the coupling prediction model, and has high stability and robustness.


Author(s):  
Arunkumar Chinnaswamy ◽  
Ramakrishnan Srinivasan

The process of Feature selection in machine learning involves the reduction in the number of features (genes) and similar activities that results in an acceptable level of classification accuracy. This paper discusses the filter based feature selection methods such as Information Gain and Correlation coefficient. After the process of feature selection is performed, the selected genes are subjected to five classification problems such as Naïve Bayes, Bagging, Random Forest, J48 and Decision Stump. The same experiment is performed on the raw data as well. Experimental results show that the filter based approaches reduce the number of gene expression levels effectively and thereby has a reduced feature subset that produces higher classification accuracy compared to the same experiment performed on the raw data. Also Correlation Based Feature Selection uses very fewer genes and produces higher accuracy compared to Information Gain based Feature Selection approach.


Author(s):  
Alok Kumar Shukla ◽  
Pradeep Singh ◽  
Manu Vardhan

The explosion of the high-dimensional dataset in the scientific repository has been encouraging interdisciplinary research on data mining, pattern recognition and bioinformatics. The fundamental problem of the individual Feature Selection (FS) method is extracting informative features for classification model and to seek for the malignant disease at low computational cost. In addition, existing FS approaches overlook the fact that for a given cardinality, there can be several subsets with similar information. This paper introduces a novel hybrid FS algorithm, called Filter-Wrapper Feature Selection (FWFS) for a classification problem and also addresses the limitations of existing methods. In the proposed model, the front-end filter ranking method as Conditional Mutual Information Maximization (CMIM) selects the high ranked feature subset while the succeeding method as Binary Genetic Algorithm (BGA) accelerates the search in identifying the significant feature subsets. One of the merits of the proposed method is that, unlike an exhaustive method, it speeds up the FS procedure without lancing of classification accuracy on reduced dataset when a learning model is applied to the selected subsets of features. The efficacy of the proposed (FWFS) method is examined by Naive Bayes (NB) classifier which works as a fitness function. The effectiveness of the selected feature subset is evaluated using numerous classifiers on five biological datasets and five UCI datasets of a varied dimensionality and number of instances. The experimental results emphasize that the proposed method provides additional support to the significant reduction of the features and outperforms the existing methods. For microarray data-sets, we found the lowest classification accuracy is 61.24% on SRBCT dataset and highest accuracy is 99.32% on Diffuse large B-cell lymphoma (DLBCL). In UCI datasets, the lowest classification accuracy is 40.04% on the Lymphography using k-nearest neighbor (k-NN) and highest classification accuracy is 99.05% on the ionosphere using support vector machine (SVM).


Author(s):  
ZENGLIN XU ◽  
IRWIN KING ◽  
MICHAEL R. LYU

Feature selection is an important task in pattern recognition. Support Vector Machine (SVM) and Minimax Probability Machine (MPM) have been successfully used as the classification framework for feature selection. However, these paradigms cannot automatically control the balance between prediction accuracy and the number of selected features. In addition, the selected feature subsets are also not stable in different data partitions. Minimum Error Minimax Probability Machine (MEMPM) has been proposed for classification recently. In this paper, we outline MEMPM to select the optimal feature subset with good stability and automatic balance between prediction accuracy and the size of feature subset. The experiments against feature selection with SVM and MPM show the advantages of the proposed MEMPM formulation in stability and automatic balance between the feature subset size and the prediction accuracy.


Author(s):  
Hui Wang ◽  
Li Li Guo ◽  
Yun Lin

Automatic modulation recognition is very important for the receiver design in the broadband multimedia communication system, and the reasonable signal feature extraction and selection algorithm is the key technology of Digital multimedia signal recognition. In this paper, the information entropy is used to extract the single feature, which are power spectrum entropy, wavelet energy spectrum entropy, singular spectrum entropy and Renyi entropy. And then, the feature selection algorithm of distance measurement and Sequential Feature Selection(SFS) are presented to select the optimal feature subset. Finally, the BP neural network is used to classify the signal modulation. The simulation result shows that the four-different information entropy can be used to classify different signal modulation, and the feature selection algorithm is successfully used to choose the optimal feature subset and get the best performance.


2016 ◽  
Vol 2016 ◽  
pp. 1-10 ◽  
Author(s):  
Zhi Chen ◽  
Tao Lin ◽  
Ningjiu Tang ◽  
Xin Xia

The extensive applications of support vector machines (SVMs) require efficient method of constructing a SVM classifier with high classification ability. The performance of SVM crucially depends on whether optimal feature subset and parameter of SVM can be efficiently obtained. In this paper, a coarse-grained parallel genetic algorithm (CGPGA) is used to simultaneously optimize the feature subset and parameters for SVM. The distributed topology and migration policy of CGPGA can help find optimal feature subset and parameters for SVM in significantly shorter time, so as to increase the quality of solution found. In addition, a new fitness function, which combines the classification accuracy obtained from bootstrap method, the number of chosen features, and the number of support vectors, is proposed to lead the search of CGPGA to the direction of optimal generalization error. Experiment results on 12 benchmark datasets show that our proposed approach outperforms genetic algorithm (GA) based method and grid search method in terms of classification accuracy, number of chosen features, number of support vectors, and running time.


Sign in / Sign up

Export Citation Format

Share Document