A Multicriteria Approach to Find Predictive and Sparse Models with Stable Feature Selection for High-Dimensional Data

Finding a good predictive model for a high-dimensional data set can be challenging. For genetic data, it is not only important to find a model with high predictive accuracy, but it is also important that this model uses only few features and that the selection of these features is stable. This is because, in bioinformatics, the models are used not only for prediction but also for drawing biological conclusions which makes the interpretability and reliability of the model crucial. We suggest using three target criteria when fitting a predictive model to a high-dimensional data set: the classification accuracy, the stability of the feature selection, and the number of chosen features. As it is unclear which measure is best for evaluating the stability, we first compare a variety of stability measures. We conclude that the Pearson correlation has the best theoretical and empirical properties. Also, we find that for the stability assessment behaviour it is most important that a measure contains a correction for chance or large numbers of chosen features. Then, we analyse Pareto fronts and conclude that it is possible to find models with a stable selection of few features without losing much predictive accuracy.

Download Full-text

A Novel Granularity Optimal Feature Selection based on Multi-Variant Clustering for High Dimensional Data

Turkish Journal of Computer and Mathematics Education (TURCOMAT) ◽

10.17762/turcomat.v12i3.2031 ◽

2021 ◽

Vol 12 (3) ◽

pp. 5051-5062

Author(s):

Srinivas Kolli Et. al.

Keyword(s):

Feature Selection ◽

High Dimensional Data ◽

Classification Performance ◽

High Dimensional ◽

Second Phase ◽

Data Sets ◽

Aggressive Approach ◽

Related Data ◽

Optimal Feature ◽

Selection Of

Clustering is the most complex in multi/high dimensional data because of sub feature selection from overall features present in categorical data sources. Sub set feature be the aggressive approach to decrease feature dimensionality in mining of data, identification of patterns. Main aim behind selection of feature with respect to selection of optimal feature and decrease the redundancy. In-order to compute with redundant/irrelevant features in high dimensional sample data exploration based on feature selection calculation with data granular described in this document. Propose aNovel Granular Feature Multi-variant Clustering based Genetic Algorithm (NGFMCGA) model to evaluate the performance results in this implementation. This model main consists two phases, in first phase, based on theoretic graph grouping procedure divide features into different clusters, in second phase, select strongly representative related feature from each cluster with respect to matching of subset of features. Features present in this concept are independent because of features select from different clusters, proposed approach clustering have high probability in processing and increasing the quality of independent and useful features.Optimal subset feature selection improves accuracy of clustering and feature classification, performance of proposed approach describes better accuracy with respect to optimal subset selection is applied on publicly related data sets and it is compared with traditional supervised evolutionary approaches

Download Full-text

Beta Distribution-Based Cross-Entropy for Feature Selection

Entropy ◽

10.3390/e21080769 ◽

2019 ◽

Vol 21 (8) ◽

pp. 769 ◽

Cited By ~ 1

Author(s):

Weixing Dai ◽

Dianjing Guo

Keyword(s):

Feature Selection ◽

Probability Density ◽

Beta Distribution ◽

Predictive Accuracy ◽

High Dimensional Data ◽

Area Under The Curve ◽

Cross Entropy ◽

High Dimensional ◽

Generalization Ability ◽

Conventional Methods

Analysis of high-dimensional data is a challenge in machine learning and data mining. Feature selection plays an important role in dealing with high-dimensional data for improvement of predictive accuracy, as well as better interpretation of the data. Frequently used evaluation functions for feature selection include resampling methods such as cross-validation, which show an advantage in predictive accuracy. However, these conventional methods are not only computationally expensive, but also tend to be over-optimistic. We propose a novel cross-entropy which is based on beta distribution for feature selection. In beta distribution-based cross-entropy (BetaDCE) for feature selection, the probability density is estimated by beta distribution and the cross-entropy is computed by the expected value of beta distribution, so that the generalization ability can be estimated more precisely than conventional methods where the probability density is learnt from data. Analysis of the generalization ability of BetaDCE revealed that it was a trade-off between bias and variance. The robustness of BetaDCE was demonstrated by experiments on three types of data. In the exclusive or-like (XOR-like) dataset, the false discovery rate of BetaDCE was significantly smaller than that of other methods. For the leukemia dataset, the area under the curve (AUC) of BetaDCE on the test set was 0.93 with only four selected features, which indicated that BetaDCE not only detected the irrelevant and redundant features precisely, but also more accurately predicted the class labels with a smaller number of features than the original method, whose AUC was 0.83 with 50 features. In the metabonomic dataset, the overall AUC of prediction with features selected by BetaDCE was significantly larger than that by the original reported method. Therefore, BetaDCE can be used as a general and efficient framework for feature selection.

Download Full-text

Feature Selection of High Dimensional Data by Adaptive Potential Particle Swarm Optimization

2019 IEEE Congress on Evolutionary Computation (CEC) ◽

10.1109/cec.2019.8790366 ◽

2019 ◽

Cited By ~ 3

Author(s):

Xingyue Huang ◽

Yizhou Chi ◽

Yu Zhou

Keyword(s):

Feature Selection ◽

Particle Swarm Optimization ◽

High Dimensional Data ◽

Particle Swarm ◽

High Dimensional ◽

Adaptive Potential ◽

Swarm Optimization ◽

Selection Of

Download Full-text

A comparative study of various feature selection techniques in high-dimensional data set to improve classification accuracy

2015 International Conference on Computer Communication and Informatics (ICCCI) ◽

10.1109/iccci.2015.7218098 ◽

2015 ◽

Cited By ~ 3

Author(s):

Kandarp P. Shroff ◽

Hardik H. Maheta

Keyword(s):

Feature Selection ◽

Comparative Study ◽

Classification Accuracy ◽

High Dimensional Data ◽

High Dimensional ◽

Data Set ◽

Feature Selection Techniques

Download Full-text

DEPSOSVM: variant of differential evolution based on PSO for image and text data classification

International Journal of Intelligent Computing and Cybernetics ◽

10.1108/ijicc-01-2020-0004 ◽

2020 ◽

Vol 13 (2) ◽

pp. 223-238

Author(s):

Abhishek Dixit ◽

Ashish Mani ◽

Rohit Bansal

Keyword(s):

Feature Selection ◽

Differential Evolution ◽

Classification Accuracy ◽

High Dimensional Data ◽

High Dimensional ◽

Svm Classifier ◽

Text Data ◽

Data Set ◽

Content Type ◽

Mutation Strategy

PurposeFeature selection is an important step for data pre-processing specially in the case of high dimensional data set. Performance of the data model is reduced if the model is trained with high dimensional data set, and it results in poor classification accuracy. Therefore, before training the model an important step to apply is the feature selection on the dataset to improve the performance and classification accuracy.Design/methodology/approachA novel optimization approach that hybridizes binary particle swarm optimization (BPSO) and differential evolution (DE) for fine tuning of SVM classifier is presented. The name of the implemented classifier is given as DEPSOSVM.FindingsThis approach is evaluated using 20 UCI benchmark text data classification data set. Further, the performance of the proposed technique is also evaluated on UCI benchmark image data set of cancer images. From the results, it can be observed that the proposed DEPSOSVM techniques have significant improvement in performance over other algorithms in the literature for feature selection. The proposed technique shows better classification accuracy as well.Originality/valueThe proposed approach is different from the previous work, as in all the previous work DE/(rand/1) mutation strategy is used whereas in this study DE/(rand/2) is used and the mutation strategy with BPSO is updated. Another difference is on the crossover approach in our case as we have used a novel approach of comparing best particle with sigmoid function. The core contribution of this paper is to hybridize DE with BPSO combined with SVM classifier (DEPSOSVM) to handle the feature selection problems.

Download Full-text

Feature Selection using Genetic Algorithm for Clustering high Dimensional Data

International Journal of Engineering & Technology ◽

10.14419/ijet.v7i2.11.11001 ◽

2018 ◽

Vol 7 (2.11) ◽

pp. 27 ◽

Cited By ~ 1

Author(s):

Kahkashan Kouser ◽

Amrita Priyam

Keyword(s):

Genetic Algorithm ◽

Feature Selection ◽

Clustering Algorithm ◽

High Dimensional Data ◽

Feature Space ◽

High Dimensional ◽

Feature Subset ◽

Data Set ◽

Optimal Feature Subset ◽

Optimal Feature

One of the open problems of modern data mining is clustering high dimensional data. For this in the paper a new technique called GA-HDClustering is proposed, which works in two steps. First a GA-based feature selection algorithm is designed to determine the optimal feature subset; an optimal feature subset is consisting of important features of the entire data set next, a K-means algorithm is applied using the optimal feature subset to find the clusters. On the other hand, traditional K-means algorithm is applied on the full dimensional feature space. Finally, the result of GA-HDClustering is compared with the traditional clustering algorithm. For comparison different validity matrices such as Sum of squared error (SSE), Within Group average distance (WGAD), Between group distance (BGD), Davies-Bouldin index(DBI), are used .The GA-HDClustering uses genetic algorithm for searching an effective feature subspace in a large feature space. This large feature space is made of all dimensions of the data set. The experiment performed on the standard data set revealed that the GA-HDClustering is superior to traditional clustering algorithm.

Download Full-text