Optimization of feature selection method for high dimensional data using fisher score and minimum spanning tree

Author(s):  
Bharat Singh ◽  
Jitendra Singh Sankhwar ◽  
Om Prakash Vyas
2003 ◽  
Vol 2 (4) ◽  
pp. 232-246 ◽  
Author(s):  
Diansheng Guo

Unknown (and unexpected) multivariate patterns lurking in high-dimensional datasets are often very hard to find. This paper describes a human-centered exploration environment, which incorporates a coordinated suite of computational and visualization methods to explore high-dimensional data for uncovering patterns in multivariate spaces. Specifically, it includes: (1) an interactive feature selection method for identifying potentially interesting, multidimensional subspaces from a high-dimensional data space, (2) an interactive, hierarchical clustering method for searching multivariate clusters of arbitrary shape, and (3) a suite of coordinated visualization and computational components centered around the above two methods to facilitate a human-led exploration. The implemented system is used to analyze a cancer dataset and shows that it is efficient and effective for discovering unknown and unexpected multivariate patterns from high-dimensional data.


Symmetry ◽  
2020 ◽  
Vol 12 (12) ◽  
pp. 1995
Author(s):  
Chunlei Shi ◽  
Jiacai Zhang ◽  
Xia Wu

Autism spectrum disorder (ASD) is a neurodevelopmental disorder originating in infancy and childhood that may cause language barriers and social difficulties. However, in the diagnosis of ASD, the current machine learning methods still face many challenges in determining the location of biomarkers. Here, we proposed a novel feature selection method based on the minimum spanning tree (MST) to seek neuromarkers for ASD. First, we constructed an undirected graph with nodes of candidate features. At the same time, a weight calculation method considering both feature redundancy and discriminant ability was introduced. Second, we utilized the Prim algorithm to construct the MST from the initial graph structure. Third, the sum of the edge weights of all connected nodes was sorted for each node in the MST. Then, N features corresponding to the nodes with the first N smallest sum were selected as classification features. Finally, the support vector machine (SVM) algorithm was used to evaluate the discriminant performance of the aforementioned feature selection method. Comparative experiments results show that our proposed method has improved the ASD classification performance, i.e., the accuracy, sensitivity, and specificity were 86.7%, 87.5%, and 85.7%, respectively.


Author(s):  
Maria Irmina Prasetiyowati ◽  
Nur Ulfa Maulidevi ◽  
Kridanto Surendro

Random Forest is a supervised classification method based on bagging (Bootstrap aggregating) Breiman and random selection of features. The choice of features randomly assigned to the Random Forest makes it possible that the selected feature is not necessarily informative. So it is necessary to select features in the Random Forest. The purpose of choosing this feature is to select an optimal subset of features that contain valuable information in the hope of accelerating the performance of the Random Forest method. Mainly for the execution of high-dimensional datasets such as the Parkinson, CNAE-9, and Urban Land Cover dataset. The feature selection is done using the Correlation-Based Feature Selection method, using the BestFirst method. Tests were carried out 30 times using the K-Cross Fold Validation value of 10 and dividing the dataset into 70% training and 30% testing. The experiments using the Parkinson dataset obtained a time difference of 0.27 and 0.28 seconds faster than using the Random Forest method without feature selection. Likewise, the trials in the Urban Land Cover dataset had 0.04 and 0.03 seconds, while for the CNAE-9 dataset, the difference time was 2.23 and 2.81 faster than using the Random Forest method without feature selection. These experiments showed that the Random Forest processes are faster when using the first feature selection. Likewise, the accuracy value increased in the two previous experiments, while only the CNAE-9 dataset experiment gets a lower accuracy. This research’s benefits is by first performing feature selection steps using the Correlation-Base Feature Selection method can increase the speed of performance and accuracy of the Random Forest method on high-dimensional data.


Sign in / Sign up

Export Citation Format

Share Document