Hybridization of feature selection and feature weighting for high dimensional data

2018 ◽  
Vol 49 (4) ◽  
pp. 1580-1596 ◽  
Author(s):  
Dalwinder Singh ◽  
Birmohan Singh
2015 ◽  
Vol 2015 ◽  
pp. 1-18 ◽  
Author(s):  
Thanh-Tung Nguyen ◽  
Joshua Zhexue Huang ◽  
Thuy Thi Nguyen

Random forests (RFs) have been widely used as a powerful classification method. However, with the randomization in both bagging samples and feature selection, the trees in the forest tend to select uninformative features for node splitting. This makes RFs have poor accuracy when working with high-dimensional data. Besides that, RFs have bias in the feature selection process where multivalued features are favored. Aiming at debiasing feature selection in RFs, we propose a new RF algorithm, called xRF, to select good features in learning RFs for high-dimensional data. We first remove the uninformative features usingp-value assessment, and the subset of unbiased features is then selected based on some statistical measures. This feature subset is then partitioned into two subsets. A feature weighting sampling technique is used to sample features from these two subsets for building trees. This approach enables one to generate more accurate trees, while allowing one to reduce dimensionality and the amount of data needed for learning RFs. An extensive set of experiments has been conducted on 47 high-dimensional real-world datasets including image datasets. The experimental results have shown that RFs with the proposed approach outperformed the existing random forests in increasing the accuracy and the AUC measures.


2020 ◽  
Author(s):  
D O'Neill ◽  
Andrew Lensen ◽  
Bing Xue ◽  
Mengjie Zhang

© 2018 IEEE. Clustering, an important unsupervised learning task, is very challenging on high-dimensional data, since the generated clusters can be significantly less meaningful as the number of features increases. Feature selection and/or feature weighting can address this issue by selecting and weighting only informative features. These techniques have been extensively studied in supervised learning, e.g. classification, but they are very difficult to use with clustering due to the lack of effective similarity/distance and validation measures. This paper utilises the powerful global search ability of particle swarm optimisation (PSO) on continuous problems, to propose a PSO based method for simultaneous feature selection and feature weighting for clustering on high-dimensional data, where a new validation measure is also proposed as the fitness function of the PSO method. Experiments on datasets with varying dimensionalities and different number of known clusters show that the proposed method can successfully improve clustering performance of different types of clustering algorithms over using the baseline of the original feature set.


2020 ◽  
Author(s):  
D O'Neill ◽  
Andrew Lensen ◽  
Bing Xue ◽  
Mengjie Zhang

© 2018 IEEE. Clustering, an important unsupervised learning task, is very challenging on high-dimensional data, since the generated clusters can be significantly less meaningful as the number of features increases. Feature selection and/or feature weighting can address this issue by selecting and weighting only informative features. These techniques have been extensively studied in supervised learning, e.g. classification, but they are very difficult to use with clustering due to the lack of effective similarity/distance and validation measures. This paper utilises the powerful global search ability of particle swarm optimisation (PSO) on continuous problems, to propose a PSO based method for simultaneous feature selection and feature weighting for clustering on high-dimensional data, where a new validation measure is also proposed as the fitness function of the PSO method. Experiments on datasets with varying dimensionalities and different number of known clusters show that the proposed method can successfully improve clustering performance of different types of clustering algorithms over using the baseline of the original feature set.


2012 ◽  
Vol 8 (2) ◽  
pp. 44-63 ◽  
Author(s):  
Baoxun Xu ◽  
Joshua Zhexue Huang ◽  
Graham Williams ◽  
Qiang Wang ◽  
Yunming Ye

The selection of feature subspaces for growing decision trees is a key step in building random forest models. However, the common approach using randomly sampling a few features in the subspace is not suitable for high dimensional data consisting of thousands of features, because such data often contains many features which are uninformative to classification, and the random sampling often doesn’t include informative features in the selected subspaces. Consequently, classification performance of the random forest model is significantly affected. In this paper, the authors propose an improved random forest method which uses a novel feature weighting method for subspace selection and therefore enhances classification performance over high-dimensional data. A series of experiments on 9 real life high dimensional datasets demonstrated that using a subspace size of features where M is the total number of features in the dataset, our random forest model significantly outperforms existing random forest models.


2021 ◽  
Vol 26 (1) ◽  
pp. 67-77
Author(s):  
Siva Sankari Subbiah ◽  
Jayakumar Chinnappan

Now a day, all the organizations collecting huge volume of data without knowing its usefulness. The fast development of Internet helps the organizations to capture data in many different formats through Internet of Things (IoT), social media and from other disparate sources. The dimension of the dataset increases day by day at an extraordinary rate resulting in large scale dataset with high dimensionality. The present paper reviews the opportunities and challenges of feature selection for processing the high dimensional data with reduced complexity and improved accuracy. In the modern big data world the feature selection has a significance in reducing the dimensionality and overfitting of the learning process. Many feature selection methods have been proposed by researchers for obtaining more relevant features especially from the big datasets that helps to provide accurate learning results without degradation in performance. This paper discusses the importance of feature selection, basic feature selection approaches, centralized and distributed big data processing using Hadoop and Spark, challenges of feature selection and provides the summary of the related research work done by various researchers. As a result, the big data analysis with the feature selection improves the accuracy of the learning.


Sign in / Sign up

Export Citation Format

Share Document