Feature Selection

Encyclopedia of Data Warehousing and Mining, Second Edition ◽

10.4018/978-1-60566-010-3.ch135 ◽

2011 ◽

pp. 878-882

Author(s):

Damien François

Keyword(s):

Feature Selection ◽

Time Series Prediction ◽

High Dimensional Data ◽

Principal Component ◽

Point Of View ◽

High Dimensional ◽

Feature Subset ◽

Selection Methods ◽

Reduction Techniques ◽

Dimensionality Reduction Techniques

In many applications, like function approximation, pattern recognition, time series prediction, and data mining, one has to build a model relating some features describing the data to some response value. Often, the features that are relevant for building the model are not known in advance. Feature selection methods allow removing irrelevant and/or redundant features to only keep the feature subset that are most useful to build a prediction model. The model is simpler and easier to interpret, reducing the risks of overfitting, non-convergence, etc. By contrast with other dimensionality reduction techniques such as principal component analysis or more recent nonlinear projection techniques (Lee & Verleysen 2007), which build a new, smaller set of features, the features that are selected by feature selection methods preserve their initial meaning, potentially bringing extra information about the process being modeled (Guyon 2006). Recently, the advent of high-dimensional data has raised new challenges for feature selection methods, both from the algorithmic point of view and the conceptual point of view (Liu & Motoda 2007). The problem of feature selection is exponential in nature, and many approximate algorithms are cubic with respect to the initial number of features, which may be intractable when the dimensionality of the data is large. Furthermore, high-dimensional data are often highly redundant, and two distinct subsets of features may have very similar predictive power, which can make it difficult to identify the best subset.

Download Full-text

Performance Analysis of Dimensionality Reduction Techniques in the Context of Clustering

Asian Journal of Computer Science and Technology ◽

10.51983/ajcst-2019.8.s3.2084 ◽

2019 ◽

Vol 8 (S3) ◽

pp. 66-71

Author(s):

T. Sudha ◽

P. Nagendra Kumar

Keyword(s):

Principal Component Analysis ◽

Dimensionality Reduction ◽

High Dimensional Data ◽

Principal Component ◽

Component Analysis ◽

High Dimensional ◽

Reduction Techniques ◽

Dimensionality Reduction Techniques ◽

Low Dimensional ◽

Probabilistic Principal Component Analysis

Data mining is one of the major areas of research. Clustering is one of the main functionalities of datamining. High dimensionality is one of the main issues of clustering and Dimensionality reduction can be used as a solution to this problem. The present work makes a comparative study of dimensionality reduction techniques such as t-distributed stochastic neighbour embedding and probabilistic principal component analysis in the context of clustering. High dimensional data have been reduced to low dimensional data using dimensionality reduction techniques such as t-distributed stochastic neighbour embedding and probabilistic principal component analysis. Cluster analysis has been performed on the high dimensional data as well as the low dimensional data sets obtained through t-distributed stochastic neighbour embedding and Probabilistic principal component analysis with varying number of clusters. Mean squared error; time and space have been considered as parameters for comparison. The results obtained show that time taken to convert the high dimensional data into low dimensional data using probabilistic principal component analysis is higher than the time taken to convert the high dimensional data into low dimensional data using t-distributed stochastic neighbour embedding.The space required by the data set reduced through Probabilistic principal component analysis is less than the storage space required by the data set reduced through t-distributed stochastic neighbour embedding.

Download Full-text

Opportunities and Challenges of Feature Selection Methods for High Dimensional Data: A Review

Ingénierie des systèmes d information ◽

10.18280/isi.260107 ◽

2021 ◽

Vol 26 (1) ◽

pp. 67-77

Author(s):

Siva Sankari Subbiah ◽

Jayakumar Chinnappan

Keyword(s):

Feature Selection ◽

Big Data ◽

Large Scale ◽

High Dimensional Data ◽

Research Work ◽

Basic Feature ◽

High Dimensional ◽

Selection Methods ◽

Fast Development ◽

Improved Accuracy

Now a day, all the organizations collecting huge volume of data without knowing its usefulness. The fast development of Internet helps the organizations to capture data in many different formats through Internet of Things (IoT), social media and from other disparate sources. The dimension of the dataset increases day by day at an extraordinary rate resulting in large scale dataset with high dimensionality. The present paper reviews the opportunities and challenges of feature selection for processing the high dimensional data with reduced complexity and improved accuracy. In the modern big data world the feature selection has a significance in reducing the dimensionality and overfitting of the learning process. Many feature selection methods have been proposed by researchers for obtaining more relevant features especially from the big datasets that helps to provide accurate learning results without degradation in performance. This paper discusses the importance of feature selection, basic feature selection approaches, centralized and distributed big data processing using Hadoop and Spark, challenges of feature selection and provides the summary of the related research work done by various researchers. As a result, the big data analysis with the feature selection improves the accuracy of the learning.

Download Full-text

An Asymmetric Chaotic Competitive Swarm Optimization Algorithm for Feature Selection in High-Dimensional Data

Symmetry ◽

10.3390/sym12111782 ◽

2020 ◽

Vol 12 (11) ◽

pp. 1782

Author(s):

Supailin Pichai ◽

Khamron Sunat ◽

Sirapat Chiewchanwattana

Keyword(s):

Feature Selection ◽

Optimization Algorithm ◽

Chaotic Map ◽

High Dimensional Data ◽

Quality Criteria ◽

High Dimensional ◽

Feature Subset ◽

Swarm Optimization ◽

Candidate Solution ◽

Dimensional Classification

This paper presents a method for feature selection in a high-dimensional classification context. The proposed method finds a candidate solution based on quality criteria using subset searching. In this study, the competitive swarm optimization (CSO) algorithm was implemented to solve feature selection problems in high-dimensional data. A new asymmetric chaotic function was proposed and used to generate the population and search for a CSO solution. Its histogram is right-skewed. The proposed method is named an asymmetric chaotic competitive swarm optimization algorithm (ACCSO). According to the asymmetrical property of the proposed chaotic map, ACCSO prefers zero than one. Therefore, the solution is very compact and can achieve high classification accuracy with a minimal feature subset for high-dimensional datasets. The proposed method was evaluated on 12 datasets, with dimensions ranging from 4 to 10,304. ACCSO was compared to the original CSO algorithm and other metaheuristic algorithms. Experimental results show that the proposed method can increase accuracy and it reduces the number of selected features. Compared to different optimization algorithms with other wrappers, the proposed method exhibits excellent performance.

Download Full-text

On the scalability of feature selection methods on high-dimensional data

Knowledge and Information Systems ◽

10.1007/s10115-017-1140-3 ◽

2017 ◽

Vol 56 (2) ◽

pp. 395-442 ◽

Cited By ~ 11

Author(s):

V. Bolón-Canedo ◽

D. Rego-Fernández ◽

D. Peteiro-Barral ◽

A. Alonso-Betanzos ◽

B. Guijarro-Berdiñas ◽

...

Keyword(s):

Feature Selection ◽

High Dimensional Data ◽

High Dimensional ◽

Selection Methods

Download Full-text

A SURVEY ON THE CURES FOR THE CURSE OF DIMENSIONALITY IN BIG DATA

Asian Journal of Pharmaceutical and Clinical Research ◽

10.22159/ajpcr.2017.v10s1.19755 ◽

2017 ◽

Vol 10 (13) ◽

pp. 355 ◽

Cited By ~ 1

Author(s):

Reshma Remesh ◽

Pattabiraman. V

Keyword(s):

Dimensionality Reduction ◽

Input Data ◽

Principal Component ◽

Kernel Principal Component Analysis ◽

High Dimensional ◽

Data Sets ◽

Learning Approaches ◽

Data Set ◽

Reduction Techniques ◽

Dimensionality Reduction Techniques

Dimensionality reduction techniques are used to reduce the complexity for analysis of high dimensional data sets. The raw input data set may have large dimensions and it might consume time and lead to wrong predictions if unnecessary data attributes are been considered for analysis. So using dimensionality reduction techniques one can reduce the dimensions of input data towards accurate prediction with less cost. In this paper the different machine learning approaches used for dimensionality reductions such as PCA, SVD, LDA, Kernel Principal Component Analysis and Artificial Neural Network have been studied.

Download Full-text

Enhanced Filter Feature Selection Methods for Arabic Text Categorization

International Journal of Information Retrieval Research ◽

10.4018/ijirr.2018040101 ◽

2018 ◽

Vol 8 (2) ◽

pp. 1-24 ◽

Cited By ~ 1

Author(s):

Abdullah Saeed Ghareb ◽

Azuraliza Abu Bakara ◽

Qasem A. Al-Radaideh ◽

Abdul Razak Hamdan

Keyword(s):

Feature Selection ◽

Text Categorization ◽

Selection Process ◽

High Dimensional Data ◽

Relevant Information ◽

High Dimensional ◽

Arabic Text ◽

Relevant Feature ◽

Associative Classification ◽

Selection Methods

The filtering of a large amount of data is an important process in data mining tasks, particularly for the categorization of unstructured high dimensional data. Therefore, a feature selection process is desired to reduce the space of high dimensional data into small relevant subset dimensions that represent the best features for text categorization. In this article, three enhanced filter feature selection methods, Category Relevant Feature Measure, Modified Category Discriminated Measure, and Odd Ratio2, are proposed. These methods combine the relevant information about features in both the inter- and intra-category. The effectiveness of the proposed methods with Naïve Bayes and associative classification is evaluated by traditional measures of text categorization, namely, macro-averaging of precision, recall, and F-measure. Experiments are conducted on three Arabic text datasets used for text categorization. The experimental results showed that the proposed methods are able to achieve better and comparable results when compared to 12 well known traditional methods.

Download Full-text

Unbiased Feature Selection in Learning Random Forests for High-Dimensional Data

The Scientific World JOURNAL ◽

10.1155/2015/471371 ◽

2015 ◽

Vol 2015 ◽

pp. 1-18 ◽

Cited By ~ 16

Author(s):

Thanh-Tung Nguyen ◽

Joshua Zhexue Huang ◽

Thuy Thi Nguyen

Keyword(s):

Feature Selection ◽

Random Forests ◽

Selection Process ◽

High Dimensional Data ◽

Feature Weighting ◽

High Dimensional ◽

Feature Subset ◽

Value Assessment ◽

Statistical Measures ◽

Real World Datasets

Random forests (RFs) have been widely used as a powerful classification method. However, with the randomization in both bagging samples and feature selection, the trees in the forest tend to select uninformative features for node splitting. This makes RFs have poor accuracy when working with high-dimensional data. Besides that, RFs have bias in the feature selection process where multivalued features are favored. Aiming at debiasing feature selection in RFs, we propose a new RF algorithm, called xRF, to select good features in learning RFs for high-dimensional data. We first remove the uninformative features usingp-value assessment, and the subset of unbiased features is then selected based on some statistical measures. This feature subset is then partitioned into two subsets. A feature weighting sampling technique is used to sample features from these two subsets for building trees. This approach enables one to generate more accurate trees, while allowing one to reduce dimensionality and the amount of data needed for learning RFs. An extensive set of experiments has been conducted on 47 high-dimensional real-world datasets including image datasets. The experimental results have shown that RFs with the proposed approach outperformed the existing random forests in increasing the accuracy and the AUC measures.

Download Full-text

Improved Nonnegative Matrix Factorization Based Feature Selection for High Dimensional Data Analysis

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.347-350.2344 ◽

2013 ◽

Vol 347-350 ◽

pp. 2344-2348

Author(s):

Lin Cheng Jiang ◽

Wen Tang Tan ◽

Zhen Wen Wang ◽

Feng Jing Yin ◽

Bin Ge ◽

...

Keyword(s):

Feature Selection ◽

Matrix Factorization ◽

Nonnegative Matrix Factorization ◽

High Dimensional Data ◽

Nonnegative Matrix ◽

High Dimensional ◽

Feature Subset ◽

Feature Extraction Method ◽

Optimal Feature Subset ◽

Optimal Feature

Feature selection has become the focus of research areas of applications with high dimensional data. Nonnegative matrix factorization (NMF) is a good method for dimensionality reduction but it cant select the optimal feature subset for its a feature extraction method. In this paper, a two-step strategy method based on improved NMF is proposed.The first step is to get the basis of each catagory in the dataset by NMF. Added constrains can guarantee these basises are sparse and mostly distinguish from each other which can contribute to classfication. An auxiliary function is used to prove the algorithm convergent.The classic ReliefF algorithm is used to weight each feature by all the basis vectors and choose the optimal feature subset in the second step.The experimental results revealed that the proposed method can select a representive and relevant feature subset which is effective in improving the performance of the classifier.

Download Full-text