Preordonance correlation filter for feature selection in the high dimensional classification problem

Abstract Background: Large-dimensional data modelling often relies on variable reduction methods in the pre-processing and in the post-processing stage. However, such a reduction usually provides less information and yields a lower accuracy of the model. Objectives: The aim of this paper is to assess the high-dimensional classification problem of recognizing entrepreneurial intentions of students by machine learning methods. Methods/Approach: Four methods were tested: artificial neural networks, CART classification trees, support vector machines, and k-nearest neighbour on the same dataset in order to compare their efficiency in the sense of classification accuracy. The performance of each method was compared on ten subsamples in a 10-fold cross-validation procedure in order to assess computing sensitivity and specificity of each model. Results: The artificial neural network model based on multilayer perceptron yielded a higher classification rate than the models produced by other methods. The pairwise t-test showed a statistical significance between the artificial neural network and the k-nearest neighbour model, while the difference among other methods was not statistically significant. Conclusions: Tested machine learning methods are able to learn fast and achieve high classification accuracy. However, further advancement can be assured by testing a few additional methodological refinements in machine learning methods.

Download Full-text

An Asymmetric Chaotic Competitive Swarm Optimization Algorithm for Feature Selection in High-Dimensional Data

Symmetry ◽

10.3390/sym12111782 ◽

2020 ◽

Vol 12 (11) ◽

pp. 1782

Author(s):

Supailin Pichai ◽

Khamron Sunat ◽

Sirapat Chiewchanwattana

Keyword(s):

Feature Selection ◽

Optimization Algorithm ◽

Chaotic Map ◽

High Dimensional Data ◽

Quality Criteria ◽

High Dimensional ◽

Feature Subset ◽

Swarm Optimization ◽

Candidate Solution ◽

Dimensional Classification

This paper presents a method for feature selection in a high-dimensional classification context. The proposed method finds a candidate solution based on quality criteria using subset searching. In this study, the competitive swarm optimization (CSO) algorithm was implemented to solve feature selection problems in high-dimensional data. A new asymmetric chaotic function was proposed and used to generate the population and search for a CSO solution. Its histogram is right-skewed. The proposed method is named an asymmetric chaotic competitive swarm optimization algorithm (ACCSO). According to the asymmetrical property of the proposed chaotic map, ACCSO prefers zero than one. Therefore, the solution is very compact and can achieve high classification accuracy with a minimal feature subset for high-dimensional datasets. The proposed method was evaluated on 12 datasets, with dimensions ranging from 4 to 10,304. ACCSO was compared to the original CSO algorithm and other metaheuristic algorithms. Experimental results show that the proposed method can increase accuracy and it reduces the number of selected features. Compared to different optimization algorithms with other wrappers, the proposed method exhibits excellent performance.

Download Full-text

A self-adaptive level-based learning artificial bee colony algorithm for feature selection on high-dimensional classification

10.21203/rs.3.rs-1123323/v1 ◽

2021 ◽

Author(s):

Jing Wang ◽

Yuanzi Zhang ◽

Minglin Hong ◽

Haiyang He ◽

Shiguo Huang

Keyword(s):

Feature Selection ◽

Artificial Bee Colony Algorithm ◽

Artificial Bee Colony ◽

Adaptive Method ◽

High Dimensional ◽

Feature Subset ◽

Feature Selection Problem ◽

Bee Colony ◽

Dimensional Classification ◽

Self Adaptive

Abstract Feature selection is an important data preprocessing method in data mining and machine learning, yet it faces the challenge of “curse of dimensionality” when dealing with high-dimensional data. In this paper, a self-adaptive level-based learning artificial bee colony (SLLABC) algorithm is proposed for high-dimensional feature selection problem. The SLLABC algorithm includes three new mechanisms: (1) A novel level-based learning mechanism is introduced to accelerate the convergence of the basic artificial bee colony algorithm, which divides the population into several levels and the individuals on each level learn from the individuals on higher levels, especially, the individuals on the highest level learn from each other. (2) A self-adaptive method is proposed to keep the balance between exploration and exploitation abilities, which takes the diversity of population into account to determine the number of levels. The lower the diversity is, the fewer the levels are divided. (3) A new update mechanism is proposed to reduce the number of selected features. In this mechanism, if the error rate of an offspring is higher than or is equal to that of its parent but selects more features, then the offspring is discarded and the parent is retained, otherwise, the offspring replaces its parent. Further, we discuss and analyze the contribution of these novelties to the diversity of population and the performance of classification. Finally, the results, compared with 8 state-of-the-art algorithms on 12 high-dimensional datasets, confirm the competitive performance of the proposed SLLABC on both classification accuracy and the size of the feature subset.

Download Full-text

Genetic programming for feature construction and selection in classification on high-dimensional data

10.26686/wgtn.14312465 ◽

2021 ◽

Author(s):

Binh Tran ◽

Bing Xue ◽

Mengjie Zhang

Keyword(s):

Feature Selection ◽

Genetic Programming ◽

High Dimensional Data ◽

Classification Performance ◽

High Dimensional ◽

Feature Construction ◽

Classification Problems ◽

Memetic Computing ◽

Dimensional Classification ◽

High Level

Classification on high-dimensional data with thousands to tens of thousands of dimensions is a challenging task due to the high dimensionality and the quality of the feature set. The problem can be addressed by using feature selection to choose only informative features or feature construction to create new high-level features. Genetic programming (GP) using a tree-based representation can be used for both feature construction and implicit feature selection. This work presents a comprehensive study to investigate the use of GP for feature construction and selection on high-dimensional classification problems. Different combinations of the constructed and/or selected features are tested and compared on seven high-dimensional gene expression problems, and different classification algorithms are used to evaluate their performance. The results show that the constructed and/or selected feature sets can significantly reduce the dimensionality and maintain or even increase the classification accuracy in most cases. The cases with overfitting occurred are analysed via the distribution of features. Further analysis is also performed to show why the constructed feature can achieve promising classification performance. This is a post-peer-review, pre-copyedit version of an article published in 'Memetic Computing'. The final authenticated version is available online at: https://doi.org/10.1007/s12293-015-0173-y. The following terms of use apply: https://www.springer.com/gp/open-access/publication-policies/aam-terms-of-use.

Download Full-text

Genetic programming for feature construction and selection in classification on high-dimensional data

10.26686/wgtn.14312465.v1 ◽

2021 ◽

Author(s):

Binh Tran ◽

Bing Xue ◽

Mengjie Zhang

Keyword(s):

Feature Selection ◽

Genetic Programming ◽

High Dimensional Data ◽

Classification Performance ◽

High Dimensional ◽

Feature Construction ◽

Classification Problems ◽

Memetic Computing ◽

Dimensional Classification ◽

High Level

Classification on high-dimensional data with thousands to tens of thousands of dimensions is a challenging task due to the high dimensionality and the quality of the feature set. The problem can be addressed by using feature selection to choose only informative features or feature construction to create new high-level features. Genetic programming (GP) using a tree-based representation can be used for both feature construction and implicit feature selection. This work presents a comprehensive study to investigate the use of GP for feature construction and selection on high-dimensional classification problems. Different combinations of the constructed and/or selected features are tested and compared on seven high-dimensional gene expression problems, and different classification algorithms are used to evaluate their performance. The results show that the constructed and/or selected feature sets can significantly reduce the dimensionality and maintain or even increase the classification accuracy in most cases. The cases with overfitting occurred are analysed via the distribution of features. Further analysis is also performed to show why the constructed feature can achieve promising classification performance. This is a post-peer-review, pre-copyedit version of an article published in 'Memetic Computing'. The final authenticated version is available online at: https://doi.org/10.1007/s12293-015-0173-y. The following terms of use apply: https://www.springer.com/gp/open-access/publication-policies/aam-terms-of-use.

Download Full-text