Reduction in High-Dimensional Data by using HDRF with Random Forest Classifier

The selection of feature subspaces for growing decision trees is a key step in building random forest models. However, the common approach using randomly sampling a few features in the subspace is not suitable for high dimensional data consisting of thousands of features, because such data often contains many features which are uninformative to classification, and the random sampling often doesn’t include informative features in the selected subspaces. Consequently, classification performance of the random forest model is significantly affected. In this paper, the authors propose an improved random forest method which uses a novel feature weighting method for subspace selection and therefore enhances classification performance over high-dimensional data. A series of experiments on 9 real life high dimensional datasets demonstrated that using a subspace size of features where M is the total number of features in the dataset, our random forest model significantly outperforms existing random forest models.

Download Full-text

Generalizing rules by random forest-based learning classifier systems for high-dimensional data mining

Proceedings of the Genetic and Evolutionary Computation Conference Companion on - GECCO '18 ◽

10.1145/3205651.3208298 ◽

2018 ◽

Author(s):

Fumito Uwano ◽

Koji Dobashi ◽

Keiki Takadama ◽

Tim Kovacs

Keyword(s):

Data Mining ◽

Random Forest ◽

High Dimensional Data ◽

Learning Classifier Systems ◽

High Dimensional ◽

Classifier Systems ◽

Learning Classifier

Download Full-text

Classification Application Based on Mutual Information and Random Forest Method for High Dimensional Data

2017 9th International Conference on Intelligent Human-Machine Systems and Cybernetics (IHMSC) ◽

10.1109/ihmsc.2017.45 ◽

2017 ◽

Author(s):

Qingqing Kong ◽

Huili Gong ◽

Xiangqian Ding ◽

Ruichun Hou

Keyword(s):

Random Forest ◽

Mutual Information ◽

High Dimensional Data ◽

High Dimensional ◽

Random Forest Method

Download Full-text

Random forest Granger causality for detection of effective brain connectivity using high-dimensional data

Journal of Integrative Neuroscience ◽

10.1142/s0219635216500035 ◽

2016 ◽

Vol 15 (01) ◽

pp. 55-66 ◽

Cited By ~ 10

Author(s):

Mohammad Shaheryar Furqan ◽

Mohammad Yakoob Siyal

Keyword(s):

Random Forest ◽

Granger Causality ◽

Brain Connectivity ◽

High Dimensional Data ◽

High Dimensional

Download Full-text

An Improved Algorithm Based on Fast Search and Find of Density Peak Clustering for High-Dimensional Data

Wireless Communications and Mobile Computing ◽

10.1155/2021/9977884 ◽

2021 ◽

Vol 2021 ◽

pp. 1-12

Author(s):

Hui Du ◽

Yiyang Ni ◽

Zhihe Wang

Keyword(s):

Random Forest ◽

Clustering Algorithm ◽

High Dimensional Data ◽

Mean Value ◽

High Dimensional ◽

Importance Value ◽

Density Peak ◽

The Mean ◽

Density Peak Clustering ◽

Improved Algorithm

The find of density peak clustering algorithm (FDP) has poor performance on high-dimensional data. This problem occurs because the clustering algorithm ignores the feature selection. All features are evaluated and calculated under the same weight, without distinguishing. This will lead to the final clustering effect which cannot achieve the expected. Aiming at this problem, we propose a new method to solve it. We calculate the importance value of all features of high-dimensional data and calculate the mean value by constructing random forest. The features whose importance value is less than 10% of the mean value are removed. At this time, we extract the important features to form a new dataset. At this time, improved t-SNE is used for dimension reduction, and better performance will be obtained. This method uses t-SNE that is improved by the idea of random forest to reduce the dimension of the original data and combines with improved FDP to compose the new clustering method. Through experiments, we find that the evaluation index NMI of the improved algorithm proposed in this paper is 23% higher than that of the original FDP algorithm, and 9.1% higher than that of other clustering algorithms ( K -means, DBSCAN, and spectral clustering). It has good performance in high-dimensional datasets that are verified by experiments on UCI datasets and wireless sensor networks.

Download Full-text

Laplacian-Weighted Random Forest for High-Dimensional Data Classification

2019 IEEE Symposium Series on Computational Intelligence (SSCI) ◽

10.1109/ssci44817.2019.9003067 ◽

2019 ◽

Author(s):

Jianheng Liang ◽

Dong Huang

Keyword(s):

Random Forest ◽

High Dimensional Data ◽

Data Classification ◽

High Dimensional

Download Full-text

Fuzzy Forests: Extending Random Forest Feature Selection for Correlated, High-Dimensional Data

Journal of Statistical Software ◽

10.18637/jss.v091.i09 ◽

2019 ◽

Vol 91 (9) ◽

Cited By ~ 1

Author(s):

Daniel Conn ◽

Tuck Ngun ◽

Gang Li ◽

Christina M. Ramirez

Keyword(s):

Feature Selection ◽

Random Forest ◽

High Dimensional Data ◽

High Dimensional ◽

Selection For

Download Full-text

Feature selection to increase the random forest method performance on high dimensional data

International Journal of Advances in Intelligent Informatics ◽

10.26555/ijain.v6i3.471 ◽

2020 ◽

Vol 6 (3) ◽

pp. 303

Author(s):

Maria Irmina Prasetiyowati ◽

Nur Ulfa Maulidevi ◽

Kridanto Surendro

Keyword(s):

Feature Selection ◽

Random Forest ◽

Land Cover ◽

High Dimensional Data ◽

Feature Selection Method ◽

Urban Land ◽

Selection Method ◽

High Dimensional ◽

Urban Land Cover ◽

Random Forest Method

Random Forest is a supervised classification method based on bagging (Bootstrap aggregating) Breiman and random selection of features. The choice of features randomly assigned to the Random Forest makes it possible that the selected feature is not necessarily informative. So it is necessary to select features in the Random Forest. The purpose of choosing this feature is to select an optimal subset of features that contain valuable information in the hope of accelerating the performance of the Random Forest method. Mainly for the execution of high-dimensional datasets such as the Parkinson, CNAE-9, and Urban Land Cover dataset. The feature selection is done using the Correlation-Based Feature Selection method, using the BestFirst method. Tests were carried out 30 times using the K-Cross Fold Validation value of 10 and dividing the dataset into 70% training and 30% testing. The experiments using the Parkinson dataset obtained a time difference of 0.27 and 0.28 seconds faster than using the Random Forest method without feature selection. Likewise, the trials in the Urban Land Cover dataset had 0.04 and 0.03 seconds, while for the CNAE-9 dataset, the difference time was 2.23 and 2.81 faster than using the Random Forest method without feature selection. These experiments showed that the Random Forest processes are faster when using the first feature selection. Likewise, the accuracy value increased in the two previous experiments, while only the CNAE-9 dataset experiment gets a lower accuracy. This research’s benefits is by first performing feature selection steps using the Correlation-Base Feature Selection method can increase the speed of performance and accuracy of the Random Forest method on high-dimensional data.

Download Full-text