Hybridization of feature selection and feature weighting for high dimensional data

Dalwinder Singh; Birmohan Singh

doi:10.1007/s10489-018-1348-2

Unbiased Feature Selection in Learning Random Forests for High-Dimensional Data

The Scientific World JOURNAL ◽

10.1155/2015/471371 ◽

2015 ◽

Vol 2015 ◽

pp. 1-18 ◽

Cited By ~ 16

Author(s):

Thanh-Tung Nguyen ◽

Joshua Zhexue Huang ◽

Thuy Thi Nguyen

Keyword(s):

Feature Selection ◽

Random Forests ◽

Selection Process ◽

High Dimensional Data ◽

Feature Weighting ◽

High Dimensional ◽

Feature Subset ◽

Value Assessment ◽

Statistical Measures ◽

Real World Datasets

Random forests (RFs) have been widely used as a powerful classification method. However, with the randomization in both bagging samples and feature selection, the trees in the forest tend to select uninformative features for node splitting. This makes RFs have poor accuracy when working with high-dimensional data. Besides that, RFs have bias in the feature selection process where multivalued features are favored. Aiming at debiasing feature selection in RFs, we propose a new RF algorithm, called xRF, to select good features in learning RFs for high-dimensional data. We first remove the uninformative features usingp-value assessment, and the subset of unbiased features is then selected based on some statistical measures. This feature subset is then partitioned into two subsets. A feature weighting sampling technique is used to sample features from these two subsets for building trees. This approach enables one to generate more accurate trees, while allowing one to reduce dimensionality and the amount of data needed for learning RFs. An extensive set of experiments has been conducted on 47 high-dimensional real-world datasets including image datasets. The experimental results have shown that RFs with the proposed approach outperformed the existing random forests in increasing the accuracy and the AUC measures.

Download Full-text

Particle Swarm Optimisation for Feature Selection and Weighting in High-Dimensional Clustering

10.26686/wgtn.13058747 ◽

2020 ◽

Author(s):

D O'Neill ◽

Andrew Lensen ◽

Bing Xue ◽

Mengjie Zhang

Keyword(s):

Feature Selection ◽

Fitness Function ◽

Clustering Algorithms ◽

High Dimensional Data ◽

Particle Swarm ◽

Learning Task ◽

Particle Swarm Optimisation ◽

Feature Weighting ◽

High Dimensional ◽

Validation Measure

© 2018 IEEE. Clustering, an important unsupervised learning task, is very challenging on high-dimensional data, since the generated clusters can be significantly less meaningful as the number of features increases. Feature selection and/or feature weighting can address this issue by selecting and weighting only informative features. These techniques have been extensively studied in supervised learning, e.g. classification, but they are very difficult to use with clustering due to the lack of effective similarity/distance and validation measures. This paper utilises the powerful global search ability of particle swarm optimisation (PSO) on continuous problems, to propose a PSO based method for simultaneous feature selection and feature weighting for clustering on high-dimensional data, where a new validation measure is also proposed as the fitness function of the PSO method. Experiments on datasets with varying dimensionalities and different number of known clusters show that the proposed method can successfully improve clustering performance of different types of clustering algorithms over using the baseline of the original feature set.

Download Full-text

Particle Swarm Optimisation for Feature Selection and Weighting in High-Dimensional Clustering

10.26686/wgtn.13058747.v1 ◽

2020 ◽

Author(s):

D O'Neill ◽

Andrew Lensen ◽

Bing Xue ◽

Mengjie Zhang

Keyword(s):

Feature Selection ◽

Fitness Function ◽

Clustering Algorithms ◽

High Dimensional Data ◽

Particle Swarm ◽

Learning Task ◽

Particle Swarm Optimisation ◽

Feature Weighting ◽

High Dimensional ◽

Validation Measure

© 2018 IEEE. Clustering, an important unsupervised learning task, is very challenging on high-dimensional data, since the generated clusters can be significantly less meaningful as the number of features increases. Feature selection and/or feature weighting can address this issue by selecting and weighting only informative features. These techniques have been extensively studied in supervised learning, e.g. classification, but they are very difficult to use with clustering due to the lack of effective similarity/distance and validation measures. This paper utilises the powerful global search ability of particle swarm optimisation (PSO) on continuous problems, to propose a PSO based method for simultaneous feature selection and feature weighting for clustering on high-dimensional data, where a new validation measure is also proposed as the fitness function of the PSO method. Experiments on datasets with varying dimensionalities and different number of known clusters show that the proposed method can successfully improve clustering performance of different types of clustering algorithms over using the baseline of the original feature set.

Download Full-text

BagMeLiF: stable boosting-based hybrid-ensemble feature selection algorithm for high-dimensional data

2020 International Conference on Control, Robotics and Intelligent System ◽

10.1145/3437802.3437835 ◽

2020 ◽

Author(s):

Nikita Pilnenskiy ◽

Ivan Smetannikov

Keyword(s):

Feature Selection ◽

High Dimensional Data ◽

High Dimensional ◽

Selection Algorithm ◽

Feature Selection Algorithm

Download Full-text

Classifying Very High-Dimensional Data with Random Forests Built from Small Subspaces

International Journal of Data Warehousing and Mining ◽

10.4018/jdwm.2012040103 ◽

2012 ◽

Vol 8 (2) ◽

pp. 44-63 ◽

Cited By ~ 30

Author(s):

Baoxun Xu ◽

Joshua Zhexue Huang ◽

Graham Williams ◽

Qiang Wang ◽

Yunming Ye

Keyword(s):

Random Forest ◽

High Dimensional Data ◽

Real Life ◽

Classification Performance ◽

Feature Weighting ◽

Random Forest Model ◽

High Dimensional ◽

Forest Model ◽

Forest Models ◽

Random Forest Models

The selection of feature subspaces for growing decision trees is a key step in building random forest models. However, the common approach using randomly sampling a few features in the subspace is not suitable for high dimensional data consisting of thousands of features, because such data often contains many features which are uninformative to classification, and the random sampling often doesn’t include informative features in the selected subspaces. Consequently, classification performance of the random forest model is significantly affected. In this paper, the authors propose an improved random forest method which uses a novel feature weighting method for subspace selection and therefore enhances classification performance over high-dimensional data. A series of experiments on 9 real life high dimensional datasets demonstrated that using a subspace size of features where M is the total number of features in the dataset, our random forest model significantly outperforms existing random forest models.

Download Full-text

On fuzzy feature selection in designing fuzzy classifiers for high-dimensional data

Evolving Systems ◽

10.1007/s12530-015-9142-4 ◽

2015 ◽

Vol 7 (4) ◽

pp. 255-265 ◽

Cited By ~ 6

Author(s):

Eghbal G. Mansoori ◽

Khadijeh S. Shafiee

Keyword(s):

Feature Selection ◽

High Dimensional Data ◽

High Dimensional ◽

Fuzzy Classifiers ◽

Fuzzy Feature Selection

Download Full-text

A Hybrid Feature Selection Method Based on Symmetrical Uncertainty and Support Vector Machine for High-Dimensional Data Classification

Intelligent Information and Database Systems - Lecture Notes in Computer Science ◽

10.1007/978-3-319-54472-4_67 ◽

2017 ◽

pp. 721-727 ◽

Cited By ~ 2

Author(s):

Yongjun Piao ◽

Keun Ho Ryu

Keyword(s):

Support Vector Machine ◽

Feature Selection ◽

High Dimensional Data ◽

Feature Selection Method ◽

Data Classification ◽

Selection Method ◽

High Dimensional ◽

Support Vector ◽

Symmetrical Uncertainty

Download Full-text

High dimensional data classification and feature selection using support vector machines

European Journal of Operational Research ◽

10.1016/j.ejor.2017.08.040 ◽

2018 ◽

Vol 265 (3) ◽

pp. 993-1004 ◽

Cited By ~ 63

Author(s):

Bissan Ghaddar ◽

Joe Naoum-Sawaya

Keyword(s):

Feature Selection ◽

Support Vector Machines ◽

High Dimensional Data ◽

Data Classification ◽

High Dimensional ◽

Support Vector ◽

Vector Machines

Download Full-text

Opportunities and Challenges of Feature Selection Methods for High Dimensional Data: A Review

Ingénierie des systèmes d information ◽

10.18280/isi.260107 ◽

2021 ◽

Vol 26 (1) ◽

pp. 67-77

Author(s):

Siva Sankari Subbiah ◽

Jayakumar Chinnappan

Keyword(s):

Feature Selection ◽

Big Data ◽

Large Scale ◽

High Dimensional Data ◽

Research Work ◽

Basic Feature ◽

High Dimensional ◽

Selection Methods ◽

Fast Development ◽

Improved Accuracy

Now a day, all the organizations collecting huge volume of data without knowing its usefulness. The fast development of Internet helps the organizations to capture data in many different formats through Internet of Things (IoT), social media and from other disparate sources. The dimension of the dataset increases day by day at an extraordinary rate resulting in large scale dataset with high dimensionality. The present paper reviews the opportunities and challenges of feature selection for processing the high dimensional data with reduced complexity and improved accuracy. In the modern big data world the feature selection has a significance in reducing the dimensionality and overfitting of the learning process. Many feature selection methods have been proposed by researchers for obtaining more relevant features especially from the big datasets that helps to provide accurate learning results without degradation in performance. This paper discusses the importance of feature selection, basic feature selection approaches, centralized and distributed big data processing using Hadoop and Spark, challenges of feature selection and provides the summary of the related research work done by various researchers. As a result, the big data analysis with the feature selection improves the accuracy of the learning.

Download Full-text