A straightforward feature selection method based on mean ratio for classifiers

2021 ◽  
pp. 1-12
Author(s):  
Emmanuel Tavares ◽  
Alisson Marques Silva ◽  
Gray Farias Moita ◽  
Rodrigo Tomas Nogueira Cardoso

Feature Selection (FS) is currently a very important and prominent research area. The focus of FS is to identify and to remove irrelevant and redundant features from large data sets in order to reduced processing time and to improve the predictive ability of the algorithms. Thus, this work presents a straightforward and efficient FS method based on the mean ratio of the attributes (features) associated with each class. The proposed filtering method, here called MRFS (Mean Ratio Feature Selection), has only equations with low computational cost and with basic mathematical operations such as addition, division, and comparison. Initially, in the MRFS method, the average from the data sets associated with the different outputs is computed for each attribute. Then, the calculation of the ratio between the averages extracted from each attribute is performed. Finally, the attributes are ordered based on the mean ratio, from the smallest to the largest value. The attributes that have the lowest values are more relevant to the classification algorithms. The proposed method is evaluated and compared with three state-of-the-art methods in classification using four classifiers and ten data sets. Computational experiments and their comparisons against other feature selection methods show that MRFS is accurate and that it is a promising alternative in classification tasks.

2015 ◽  
Vol 8 (3) ◽  
pp. 1-20 ◽  
Author(s):  
Nabil M. Hewahi ◽  
Eyad A. Alashqar

Object recognition is a research area that aims to associate objects to categories or classes. The recognition of object specific geospatial features, such as roads, buildings and rivers, from high-resolution satellite imagery is a time consuming and expensive problem in the maintenance cycle of a Geographic Information System (GIS). Feature selection is the task of selecting a small subset from original features that can achieve maximum classification accuracy and reduce data dimensionality. This subset of features has some very important benefits like, it reduces computational complexity of learning algorithms, saves time, improve accuracy and the selected features can be insightful for the people involved in problem domain. This makes feature selection as an indispensable task in classification task. In this work, the authors propose a new approach that combines Genetic Algorithms (GA) with Correlation Ranking Filter (CRF) wrapper to eliminate unimportant features and obtain better features set that can show better results with various classifiers such as Neural Networks (NN), K-nearest neighbor (KNN), and Decision trees. The approach is based on GA as an optimization algorithm to search the space of all possible subsets related to object geospatial features set for the purpose of recognition. GA is wrapped with three different classifier algorithms namely neural network, k-nearest neighbor and decision tree J48 as subset evaluating mechanism. The GA-ANN, GA-KNN and GA-J48 methods are implemented using the WEKA software on dataset that contains 38 extracted features from satellite images using ENVI software. The proposed wrapper approach incorporated the Correlation Ranking Filter (CRF) for spatial features to remove unimportant features. Results suggest that GA based neural classifiers and using CRF for spatial features are robust and effective in finding optimal subsets of features from large data sets.


2018 ◽  
Vol 13 (3) ◽  
pp. 323-336 ◽  
Author(s):  
Naeimeh Elkhani ◽  
Ravie Chandren Muniyandi ◽  
Gexiang Zhang

Computational cost is a big challenge for almost all intelligent algorithms which are run on CPU. In this regard, our proposed kernel P system multi-objective binary particle swarm optimization feature selection and classification method should perform with an efficient time that we aimed to settle via using potentials of membrane computing in parallel processing and nondeterminism. Moreover, GPUs perform better with latency-tolerant, highly parallel and independent tasks. In this study, to meet all the potentials of a membrane-inspired model particularly parallelism and to improve the time cost, feature selection method implemented on GPU. The time cost of the proposed method on CPU, GPU and Multicore indicates a significant improvement via implementing method on GPU.


Mathematics ◽  
2020 ◽  
Vol 8 (11) ◽  
pp. 1971
Author(s):  
Agustin Pérez-Martín ◽  
Agustin Pérez-Torregrosa ◽  
Alejandro Rabasa ◽  
Marta Vaca

Measuring credit risk is essential for financial institutions because there is a high risk level associated with incorrect credit decisions. The Basel II agreement recommended the use of advanced credit scoring methods in order to improve the efficiency of capital allocation. The latest Basel agreement (Basel III) states that the requirements for reserves based on risk have increased. Financial institutions currently have exhaustive datasets regarding their operations; this is a problem that can be addressed by applying a good feature selection method combined with big data techniques for data management. A comparative study of selection techniques is conducted in this work to find the selector that reduces the mean square error and requires the least execution time.


2021 ◽  
Author(s):  
Rohit Ravindra Nikam ◽  
Rekha Shahapurkar

Data mining is a technique that explores the necessary data is extracted from large data sets. Privacy protection of data mining is about hiding the sensitive information or identity of breach security or without losing data usability. Sensitive data contains confidential information about individuals, businesses, and governments who must not agree upon before sharing or publishing his privacy data. Conserving data mining privacy has become a critical research area. Various evaluation metrics such as performance in terms of time efficiency, data utility, and degree of complexity or resistance to data mining techniques are used to estimate the privacy preservation of data mining techniques. Social media and smart phones produce tons of data every minute. To decision making, the voluminous data produced from the different sources can be processed and analyzed. But data analytics are vulnerable to breaches of privacy. One of the data analytics frameworks is recommendation systems commonly used by e-commerce sites such as Amazon, Flip Kart to recommend items to customers based on their purchasing habits that lead to characterized. This paper presents various techniques of privacy conservation, such as data anonymization, data randomization, generalization, data permutation, etc. such techniques which existing researchers use. We also analyze the gap between various processes and privacy preservation methods and illustrate how to overcome such issues with new innovative methods. Finally, our research describes the outcome summary of the entire literature.


2016 ◽  
Vol 28 (4) ◽  
pp. 716-742 ◽  
Author(s):  
Saurabh Paul ◽  
Petros Drineas

We introduce single-set spectral sparsification as a deterministic sampling–based feature selection technique for regularized least-squares classification, which is the classification analog to ridge regression. The method is unsupervised and gives worst-case guarantees of the generalization power of the classification function after feature selection with respect to the classification function obtained using all features. We also introduce leverage-score sampling as an unsupervised randomized feature selection method for ridge regression. We provide risk bounds for both single-set spectral sparsification and leverage-score sampling on ridge regression in the fixed design setting and show that the risk in the sampled space is comparable to the risk in the full-feature space. We perform experiments on synthetic and real-world data sets; a subset of TechTC-300 data sets, to support our theory. Experimental results indicate that the proposed methods perform better than the existing feature selection methods.


Author(s):  
MINGXIA LIU ◽  
DAOQIANG ZHANG

As thousands of features are available in many pattern recognition and machine learning applications, feature selection remains an important task to find the most compact representation of the original data. In the literature, although a number of feature selection methods have been developed, most of them focus on optimizing specific objective functions. In this paper, we first propose a general graph-preserving feature selection framework where graphs to be preserved vary in specific definitions, and show that a number of existing filter-type feature selection algorithms can be unified within this framework. Then, based on the proposed framework, a new filter-type feature selection method called sparsity score (SS) is proposed. This method aims to preserve the structure of a pre-defined l1 graph that is proven robust to data noise. Here, the modified sparse representation based on an l1-norm minimization problem is used to determine the graph adjacency structure and corresponding affinity weight matrix simultaneously. Furthermore, a variant of SS called supervised SS (SuSS) is also proposed, where the l1 graph to be preserved is constructed by using only data points from the same class. Experimental results of clustering and classification tasks on a series of benchmark data sets show that the proposed methods can achieve better performance than conventional filter-type feature selection methods.


2021 ◽  
Author(s):  
Ping Zhang ◽  
Jiyao Sheng ◽  
Wanfu Gao ◽  
Juncheng Hu ◽  
Yonghao Li

Abstract Multi-label feature selection attracts considerable attention from multi-label learning. Information-theory based multi-label feature selection methods intend to select the most informative features and reduce the uncertain amount of information of labels. Previous methods regard the uncertain amount of information of labels as constant. In fact, as the classification information of the label set is captured by features, the remaining uncertainty of each label is changing dynamically. In this paper, we categorize labels into two groups: one contains the labels with few remaining uncertainty, which means that most of classification information with respect to the labels has been obtained by the already-selected features; another group contains the labels with extensive remaining uncertainty, which means that the classification information of these labels is neglected by already-selected features. Feature selection aims to select the new features with highly relevant to the labels in the second group. Existing methods do not distinguish the difference between two label groups and ignore the dynamic change amount of information of labels. To this end, a Relevancy Ratio is designed to clarify the dynamic change amount of information of each label under the condition of the already-selected features. Afterwards, a Weighted Feature Relevancy is defined to evaluate the candidate features. Finally, a new multi-label Feature Selection method based on Weighted Feature Relevancy (WFRFS) is proposed. The experiments obtain encouraging results of WFRFS in comparison to six multi-label feature selection methods on thirteen real-world data sets.


Author(s):  
Mingyu Fan ◽  
Xiaojun Chang ◽  
Xiaoqin Zhang ◽  
Di Wang ◽  
Liang Du

Recently, structured sparsity inducing based feature selection has become a hot topic in machine learning and pattern recognition. Most of the sparsity inducing feature selection methods are designed to rank all features by certain criterion and then select the k top ranked features, where k is an integer. However, the k top features are usually not the top k features and therefore maybe a suboptimal result. In this paper, we propose a novel supervised feature selection method to directly identify the top k features. The new method is formulated as a classic regularized least squares regression model with two groups of variables. The problem with respect to one group of the variables turn out to be a 0-1 integer programming, which had been considered very hard to solve. To address this, we utilize an efficient optimization method to solve the integer programming, which first replaces the discrete 0-1 constraints with two continuous constraints and then utilizes the alternating direction method of multipliers to optimize the equivalent problem. The obtained result is the top subset with k features under the proposed criterion rather than the subset of k top features. Experiments have been conducted on benchmark data sets to show the effectiveness of proposed method.


Sign in / Sign up

Export Citation Format

Share Document