scholarly journals A Sparse-Modeling Based Approach for Class Specific Feature Selection

2019 ◽  
Vol 5 ◽  
pp. e237 ◽  
Author(s):  
Davide Nardone ◽  
Angelo Ciaramella ◽  
Antonino Staiano

In this work, we propose a novel Feature Selection framework called Sparse-Modeling Based Approach for Class Specific Feature Selection (SMBA-CSFS), that simultaneously exploits the idea of Sparse Modeling and Class-Specific Feature Selection. Feature selection plays a key role in several fields (e.g., computational biology), making it possible to treat models with fewer variables which, in turn, are easier to explain, by providing valuable insights on the importance of their role, and likely speeding up the experimental validation. Unfortunately, also corroborated by the no free lunch theorems, none of the approaches in literature is the most apt to detect the optimal feature subset for building a final model, thus it still represents a challenge. The proposed feature selection procedure conceives a two-step approach: (a) a sparse modeling-based learning technique is first used to find the best subset of features, for each class of a training set; (b) the discovered feature subsets are then fed to a class-specific feature selection scheme, in order to assess the effectiveness of the selected features in classification tasks. To this end, an ensemble of classifiers is built, where each classifier is trained on its own feature subset discovered in the previous phase, and a proper decision rule is adopted to compute the ensemble responses. In order to evaluate the performance of the proposed method, extensive experiments have been performed on publicly available datasets, in particular belonging to the computational biology field where feature selection is indispensable: the acute lymphoblastic leukemia and acute myeloid leukemia, the human carcinomas, the human lung carcinomas, the diffuse large B-cell lymphoma, and the malignant glioma. SMBA-CSFS is able to identify/retrieve the most representative features that maximize the classification accuracy. With top 20 and 80 features, SMBA-CSFS exhibits a promising performance when compared to its competitors from literature, on all considered datasets, especially those with a higher number of features. Experiments show that the proposed approach may outperform the state-of-the-art methods when the number of features is high. For this reason, the introduced approach proposes itself for selection and classification of data with a large number of features and classes.

2019 ◽  
Author(s):  
Davide Nardone ◽  
Angelo Ciaramella ◽  
Antonino Staiano

In this work, we propose a novel Feature Selection framework, called Sparse-Modeling Based Approach for Class Specific Feature Selection (SMBA-CSFS), that simultaneously exploits the idea of Sparse Modeling and Class-Specific Feature Selection. Feature selection plays a key role in several fields (e.g., computational biology), making it possible to treat models with fewer variables which, in turn, are easier to explain, by providing valuable insights on the importance of their role, and might speed the experimental validation up. Unfortunately, also corroborated by the no free lunch theorems, none of the approaches in literature is the most apt to detect the optimal feature subset for building a final model, thus it still represents a challenge. The proposed feature selection procedure conceives a two steps approach: (a) a sparse modeling-based learning technique is first used to find the best subset of features, for each class of a training set; (b) the discovered feature subsets are then fed to a class-specific feature selection scheme, in order to assess the effectiveness of the selected features in classification tasks. To this end, an ensemble of classifiers is built, where each classifier is trained on its own feature subset discovered in the previous phase, and a proper decision rule is adopted to compute the ensemble responses. In order to evaluate the performance of the proposed method, extensive experiments have been performed on publicly available datasets, in particular belonging to the computational biology field where feature selection is indispensable: the acute lymphoblastic leukemia and acute myeloid leukemia, the human carcinomas, the human lung carcinomas, the diffuse large B-cell lymphoma, and the malignant glioma. SMBA-CSFS is able to identify/retrieve the most representative features that maximize the classification accuracy. With top 20 and 80 features, SMBA-CSFS exhibits a promising performance when compared to its competitors from literature, on all considered datasets, especially those with a higher number of features. Experiments show that the proposed approach might outperform the state-of-the-art methods when the number of features is high. For this reason, the introduced approach proposes itself for selection and classification of data with a large number of features and classes.


2019 ◽  
Author(s):  
Davide Nardone ◽  
Angelo Ciaramella ◽  
Antonino Staiano

In this work, we propose a novel Feature Selection framework, called Sparse-Modeling Based Approach for Class Specific Feature Selection (SMBA-CSFS), that simultaneously exploits the idea of Sparse Modeling and Class-Specific Feature Selection. Feature selection plays a key role in several fields (e.g., computational biology), making it possible to treat models with fewer variables which, in turn, are easier to explain, by providing valuable insights on the importance of their role, and might speed the experimental validation up. Unfortunately, also corroborated by the no free lunch theorems, none of the approaches in literature is the most apt to detect the optimal feature subset for building a final model, thus it still represents a challenge. The proposed feature selection procedure conceives a two steps approach: (a) a sparse modeling-based learning technique is first used to find the best subset of features, for each class of a training set; (b) the discovered feature subsets are then fed to a class-specific feature selection scheme, in order to assess the effectiveness of the selected features in classification tasks. To this end, an ensemble of classifiers is built, where each classifier is trained on its own feature subset discovered in the previous phase, and a proper decision rule is adopted to compute the ensemble responses. In order to evaluate the performance of the proposed method, extensive experiments have been performed on publicly available datasets, in particular belonging to the computational biology field where feature selection is indispensable: the acute lymphoblastic leukemia and acute myeloid leukemia, the human carcinomas, the human lung carcinomas, the diffuse large B-cell lymphoma, and the malignant glioma. SMBA-CSFS is able to identify/retrieve the most representative features that maximize the classification accuracy. With top 20 and 80 features, SMBA-CSFS exhibits a promising performance when compared to its competitors from literature, on all considered datasets, especially those with a higher number of features. Experiments show that the proposed approach might outperform the state-of-the-art methods when the number of features is high. For this reason, the introduced approach proposes itself for selection and classification of data with a large number of features and classes.


2021 ◽  
pp. 1-18
Author(s):  
Zhang Zixian ◽  
Liu Xuning ◽  
Li Zhixiang ◽  
Hu Hongqiang

The influencing factors of coal and gas outburst are complex, now the accuracy and efficiency of outburst prediction and are not high, in order to obtain the effective features from influencing factors and realize the accurate and fast dynamic prediction of coal and gas outburst, this article proposes an outburst prediction model based on the coupling of feature selection and intelligent optimization classifier. Firstly, in view of the redundancy and irrelevance of the influencing factors of coal and gas outburst, we use Boruta feature selection method obtain the optimal feature subset from influencing factors of coal and gas outburst. Secondly, based on Apriori association rules mining method, the internal association relationship between coal and gas outburst influencing factors is mined, and the strong association rules existing in the influencing factors and samples that affect the classification of coal and gas outburst are extracted. Finally, svm is used to classify coal and gas outbursts based on the above obtained optimal feature subset and sample data, and Bayesian optimization algorithm is used to optimize the kernel parameters of svm, and the coal and gas outburst pattern recognition prediction model is established, which is compared with the existing coal and gas outbursts prediction model in literatures. Compared with the method of feature selection and association rules mining alone, the proposed model achieves the highest prediction accuracy of 93% when the feature dimension is 3, which is higher than that of Apriori association rules and Boruta feature selection, and the classification accuracy is significantly improved, However, the feature dimension decreased significantly; The results show that the proposed model is better than other prediction models, which further verifies the accuracy and applicability of the coupling prediction model, and has high stability and robustness.


Author(s):  
A. Makedonas ◽  
C. Theoharatos ◽  
V. Tsagaris ◽  
V. Anastasopoulos ◽  
S. Costicoglou

SAR based ship detection and classification are important elements of maritime monitoring applications. Recently, high-resolution SAR data have opened new possibilities to researchers for achieving improved classification results. In this work, a hierarchical vessel classification procedure is presented based on a robust feature extraction and selection scheme that utilizes scale, shape and texture features in a hierarchical way. Initially, different types of feature extraction algorithms are implemented in order to form the utilized feature pool, able to represent the structure, material, orientation and other vessel type characteristics. A two-stage hierarchical feature selection algorithm is utilized next in order to be able to discriminate effectively civilian vessels into three distinct types, in COSMO-SkyMed SAR images: cargos, small ships and tankers. In our analysis, scale and shape features are utilized in order to discriminate smaller types of vessels present in the available SAR data, or shape specific vessels. Then, the most informative texture and intensity features are incorporated in order to be able to better distinguish the civilian types with high accuracy. A feature selection procedure that utilizes heuristic measures based on features’ statistical characteristics, followed by an exhaustive research with feature sets formed by the most qualified features is carried out, in order to discriminate the most appropriate combination of features for the final classification. In our analysis, five COSMO-SkyMed SAR data with 2.2m x 2.2m resolution were used to analyse the detailed characteristics of these types of ships. A total of 111 ships with available AIS data were used in the classification process. The experimental results show that this method has good performance in ship classification, with an overall accuracy reaching 83%. Further investigation of additional features and proper feature selection is currently in progress.


Author(s):  
Alok Kumar Shukla ◽  
Pradeep Singh ◽  
Manu Vardhan

The explosion of the high-dimensional dataset in the scientific repository has been encouraging interdisciplinary research on data mining, pattern recognition and bioinformatics. The fundamental problem of the individual Feature Selection (FS) method is extracting informative features for classification model and to seek for the malignant disease at low computational cost. In addition, existing FS approaches overlook the fact that for a given cardinality, there can be several subsets with similar information. This paper introduces a novel hybrid FS algorithm, called Filter-Wrapper Feature Selection (FWFS) for a classification problem and also addresses the limitations of existing methods. In the proposed model, the front-end filter ranking method as Conditional Mutual Information Maximization (CMIM) selects the high ranked feature subset while the succeeding method as Binary Genetic Algorithm (BGA) accelerates the search in identifying the significant feature subsets. One of the merits of the proposed method is that, unlike an exhaustive method, it speeds up the FS procedure without lancing of classification accuracy on reduced dataset when a learning model is applied to the selected subsets of features. The efficacy of the proposed (FWFS) method is examined by Naive Bayes (NB) classifier which works as a fitness function. The effectiveness of the selected feature subset is evaluated using numerous classifiers on five biological datasets and five UCI datasets of a varied dimensionality and number of instances. The experimental results emphasize that the proposed method provides additional support to the significant reduction of the features and outperforms the existing methods. For microarray data-sets, we found the lowest classification accuracy is 61.24% on SRBCT dataset and highest accuracy is 99.32% on Diffuse large B-cell lymphoma (DLBCL). In UCI datasets, the lowest classification accuracy is 40.04% on the Lymphography using k-nearest neighbor (k-NN) and highest classification accuracy is 99.05% on the ionosphere using support vector machine (SVM).


Author(s):  
ZENGLIN XU ◽  
IRWIN KING ◽  
MICHAEL R. LYU

Feature selection is an important task in pattern recognition. Support Vector Machine (SVM) and Minimax Probability Machine (MPM) have been successfully used as the classification framework for feature selection. However, these paradigms cannot automatically control the balance between prediction accuracy and the number of selected features. In addition, the selected feature subsets are also not stable in different data partitions. Minimum Error Minimax Probability Machine (MEMPM) has been proposed for classification recently. In this paper, we outline MEMPM to select the optimal feature subset with good stability and automatic balance between prediction accuracy and the size of feature subset. The experiments against feature selection with SVM and MPM show the advantages of the proposed MEMPM formulation in stability and automatic balance between the feature subset size and the prediction accuracy.


Author(s):  
Hui Wang ◽  
Li Li Guo ◽  
Yun Lin

Automatic modulation recognition is very important for the receiver design in the broadband multimedia communication system, and the reasonable signal feature extraction and selection algorithm is the key technology of Digital multimedia signal recognition. In this paper, the information entropy is used to extract the single feature, which are power spectrum entropy, wavelet energy spectrum entropy, singular spectrum entropy and Renyi entropy. And then, the feature selection algorithm of distance measurement and Sequential Feature Selection(SFS) are presented to select the optimal feature subset. Finally, the BP neural network is used to classify the signal modulation. The simulation result shows that the four-different information entropy can be used to classify different signal modulation, and the feature selection algorithm is successfully used to choose the optimal feature subset and get the best performance.


Feature selection in multispectral high dimensional information is a hard labour machine learning problem because of the imbalanced classes present in the data. The existing Most of the feature selection schemes in the literature ignore the problem of class imbalance by choosing the features from the classes having more instances and avoiding significant features of the classes having less instances. In this paper, SMOTE concept is exploited to produce the required samples form minority classes. Feature selection model is formulated with the objective of reducing number of features with improved classification performance. This model is based on dimensionality reduction by opt for a subset of relevant spectral, textural and spatial features while eliminating the redundant features for the purpose of improved classification performance. Binary ALO is engaged to solve the feature selection model for optimal selection of features. The proposed ALO-SVM with wrapper concept is applied to each potential solution obtained during optimization step. The working of this methodology is tested on LANDSAT multispectral image.


The optimal feature subset selection over very high dimensional data is a vital issue. Even though the optimal features are selected, the classification of those selected features becomes a key complicated task. In order to handle these problems, a novel, Accelerated Simulated Annealing and Mutation Operator (ASAMO) feature selection algorithm is suggested in this work. For solving the classification problem, the Fuzzy Minimal Consistent Class Subset Coverage (FMCCSC) problem is introduced. In FMCCSC, consistent subset is combined with the K-Nearest Neighbour (KNN) classifier known as FMCCSC-KNN classifier. The two data sets Dorothea and Madelon from UCI machine repository are experimented for optimal feature selection and classification. The experimental results substantiate the efficiency of proposed ASAMO with FMCCSC-KNN classifier compared to Particle Swarm Optimization (PSO) and Accelerated PSO feature selection algorithms.


2020 ◽  
Vol 17 (5) ◽  
pp. 721-730
Author(s):  
Kamal Bashir ◽  
Tianrui Li ◽  
Mahama Yahaya

The most frequently used machine learning feature ranking approaches failed to present optimal feature subset for accurate prediction of defective software modules in out-of-sample data. Machine learning Feature Selection (FS) algorithms such as Chi-Square (CS), Information Gain (IG), Gain Ratio (GR), RelieF (RF) and Symmetric Uncertainty (SU) perform relatively poor at prediction, even after balancing class distribution in the training data. In this study, we propose a novel FS method based on the Maximum Likelihood Logistic Regression (MLLR). We apply this method on six software defect datasets in their sampled and unsampled forms to select useful features for classification in the context of Software Defect Prediction (SDP). The Support Vector Machine (SVM) and Random Forest (RaF) classifiers are applied on the FS subsets that are based on sampled and unsampled datasets. The performance of the models captured using Area Ander Receiver Operating Characteristics Curve (AUC) metrics are compared for all FS methods considered. The Analysis Of Variance (ANOVA) F-test results validate the superiority of the proposed method over all the FS techniques, both in sampled and unsampled data. The results confirm that the MLLR can be useful in selecting optimal feature subset for more accurate prediction of defective modules in software development process


Sign in / Sign up

Export Citation Format

Share Document