Grouped feature screening for ultra-high dimensional data for the classification model

Author(s):  
Hanji He ◽  
Guangming Deng
Kursor ◽  
2020 ◽  
Vol 10 (4) ◽  
Author(s):  
Achmad Zain Nur ◽  
Hadi Suyono ◽  
Muhammad Aswin

Data mining is a data extraction process with large dimensions and information with the aim of obtaining information as knowledge to make decisions. Problems in the data mining process often occur in high-dimensional data processing. The solution to handling problems in high-dimensional data is to apply the hybrid genetic algorithm and particle swarm optimization (HGAPSO) method to improve the performance of the C5.0 decision tree classification model to make decisions quickly, precisely and accurately on classification data. In this study, there were 3 datasets sourced from the University of California, Irvine (UCI) machine learning repositories, namely lymphography, vehicle, and wine. The HGAPSO algorithm combined with the C5.0 decision tree testing method has the optimal accuracy for processing highdimensional data. The lymphography and vehicle data obtained an accuracy of 83.78% and 71.54%. The wine dataset has an accuracy of 0.56% lower than the conventional method because the data dimensions are smaller than the lymphography and vehicle dataset.


2017 ◽  
Vol 117 (10) ◽  
pp. 2325-2339
Author(s):  
Fuzan Chen ◽  
Harris Wu ◽  
Runliang Dou ◽  
Minqiang Li

Purpose The purpose of this paper is to build a compact and accurate classifier for high-dimensional classification. Design/methodology/approach A classification approach based on class-dependent feature subspace (CFS) is proposed. CFS is a class-dependent integration of a support vector machine (SVM) classifier and associated discriminative features. For each class, our genetic algorithm (GA)-based approach evolves the best subset of discriminative features and SVM classifier simultaneously. To guarantee convergence and efficiency, the authors customize the GA in terms of encoding strategy, fitness evaluation, and genetic operators. Findings Experimental studies demonstrated that the proposed CFS-based approach is superior to other state-of-the-art classification algorithms on UCI data sets in terms of both concise interpretation and predictive power for high-dimensional data. Research limitations/implications UCI data sets rather than real industrial data are used to evaluate the proposed approach. In addition, only single-label classification is addressed in the study. Practical implications The proposed method not only constructs an accurate classification model but also obtains a compact combination of discriminative features. It is helpful for business makers to get a concise understanding of the high-dimensional data. Originality/value The authors propose a compact and effective classification approach for high-dimensional data. Instead of the same feature subset for all the classes, the proposed CFS-based approach obtains the optimal subset of discriminative feature and SVM classifier for each class. The proposed approach enhances both interpretability and predictive power for high-dimensional data.


Author(s):  
YAN LI ◽  
EDWARD HUNG ◽  
KORRIS CHUNG ◽  
JOSHUA HUANG

In this paper, a new classification method (ADCC) for high-dimensional data is proposed. In this method, a decision cluster classification (DCC) model consists of a set of disjoint decision clusters, each labeled with a dominant class that determines the class of new objects falling in the cluster. A cluster tree is first generated from a training data set by recursively calling a variable weighting k-means algorithm. Then, the DCC model is extracted from the tree. Various tests including Anderson–Darling test are used to determine the stopping condition of the tree growing. A series of experiments on both synthetic and real data sets have been conducted. Their results show that the new classification method (ADCC) performed better in accuracy and scalability than existing methods like k-NN, decision tree and SVM. ADCC is particularly suitable for large, high-dimensional data with many classes.


2021 ◽  
Vol 2021 ◽  
pp. 1-9
Author(s):  
Tuba Koç

High-dimensional data sets frequently occur in several scientific areas, and special techniques are required to analyze these types of data sets. Especially, it becomes important to apply a suitable model in classification problems. In this study, a novel approach is proposed to estimate a statistical model for high-dimensional data sets. The proposed method uses analytical hierarchical process (AHP) and information criteria for determining the optimal PCs for the classification model. The high-dimensional “colon” and “gravier” datasets were used in evaluation part. Application results demonstrate that the proposed approach can be successfully used for modeling purposes.


Sign in / Sign up

Export Citation Format

Share Document