An Ensemble Classification Method for High-Dimensional Data Using Neighborhood Rough Set

AbstractFeature selection based on the fuzzy neighborhood rough set model (FNRS) is highly popular in data mining. However, the dependent function of FNRS only considers the information present in the lower approximation of the decision while ignoring the information present in the upper approximation of the decision. This construction method may lead to the loss of some information. To solve this problem, this paper proposes a fuzzy neighborhood joint entropy model based on fuzzy neighborhood self-information measure (FNSIJE) and applies it to feature selection. First, to construct four uncertain fuzzy neighborhood self-information measures of decision variables, the concept of self-information is introduced into the upper and lower approximations of FNRS from the algebra view. The relationships between these measures and their properties are discussed in detail. It is found that the fourth measure, named tolerance fuzzy neighborhood self-information, has better classification performance. Second, an uncertainty measure based on the fuzzy neighborhood joint entropy has been proposed from the information view. Inspired by both algebra and information views, the FNSIJE is proposed. Third, the K–S test is used to delete features with weak distinguishing performance, which reduces the dimensionality of high-dimensional gene datasets, thereby reducing the complexity of high-dimensional gene datasets, and then, a forward feature selection algorithm is provided. Experimental results show that compared with related methods, the presented model can select less important features and have a higher classification accuracy.

Download Full-text

Risk of Selection of Irrelevant Features from High-Dimensional Data with Small Sample Size

Springer Proceedings in Mathematics & Statistics - Stochastic Models, Statistics and Their Applications ◽

10.1007/978-3-319-13881-7_44 ◽

2015 ◽

pp. 399-405

Author(s):

Henryk Maciejewski

Keyword(s):

Sample Size ◽

Small Sample Size ◽

High Dimensional Data ◽

Small Sample ◽

High Dimensional ◽

Selection Of

Download Full-text

An Efficient Dimensionality Reduction Approach for Small-sample Size and High-dimensional Data Modeling

Journal of Computers ◽

10.4304/jcp.9.3.576-580 ◽

2014 ◽

Vol 9 (3) ◽

Cited By ~ 5

Author(s):

Xintao Qiu ◽

Dongmei Fu ◽

Zhenduo Fu

Keyword(s):

Dimensionality Reduction ◽

Sample Size ◽

Small Sample Size ◽

High Dimensional Data ◽

Data Modeling ◽

Small Sample ◽

High Dimensional ◽

Reduction Approach

Download Full-text

A Novel Granularity Optimal Feature Selection based on Multi-Variant Clustering for High Dimensional Data

Turkish Journal of Computer and Mathematics Education (TURCOMAT) ◽

10.17762/turcomat.v12i3.2031 ◽

2021 ◽

Vol 12 (3) ◽

pp. 5051-5062

Author(s):

Srinivas Kolli Et. al.

Keyword(s):

Feature Selection ◽

High Dimensional Data ◽

Classification Performance ◽

High Dimensional ◽

Second Phase ◽

Data Sets ◽

Aggressive Approach ◽

Related Data ◽

Optimal Feature ◽

Selection Of

Clustering is the most complex in multi/high dimensional data because of sub feature selection from overall features present in categorical data sources. Sub set feature be the aggressive approach to decrease feature dimensionality in mining of data, identification of patterns. Main aim behind selection of feature with respect to selection of optimal feature and decrease the redundancy. In-order to compute with redundant/irrelevant features in high dimensional sample data exploration based on feature selection calculation with data granular described in this document. Propose aNovel Granular Feature Multi-variant Clustering based Genetic Algorithm (NGFMCGA) model to evaluate the performance results in this implementation. This model main consists two phases, in first phase, based on theoretic graph grouping procedure divide features into different clusters, in second phase, select strongly representative related feature from each cluster with respect to matching of subset of features. Features present in this concept are independent because of features select from different clusters, proposed approach clustering have high probability in processing and increasing the quality of independent and useful features.Optimal subset feature selection improves accuracy of clustering and feature classification, performance of proposed approach describes better accuracy with respect to optimal subset selection is applied on publicly related data sets and it is compared with traditional supervised evolutionary approaches

Download Full-text

Genetic programming for feature construction and selection in classification on high-dimensional data

10.26686/wgtn.14312465 ◽

2021 ◽

Author(s):

Binh Tran ◽

Bing Xue ◽

Mengjie Zhang

Keyword(s):

Feature Selection ◽

Genetic Programming ◽

High Dimensional Data ◽

Classification Performance ◽

High Dimensional ◽

Feature Construction ◽

Classification Problems ◽

Memetic Computing ◽

Dimensional Classification ◽

High Level

Classification on high-dimensional data with thousands to tens of thousands of dimensions is a challenging task due to the high dimensionality and the quality of the feature set. The problem can be addressed by using feature selection to choose only informative features or feature construction to create new high-level features. Genetic programming (GP) using a tree-based representation can be used for both feature construction and implicit feature selection. This work presents a comprehensive study to investigate the use of GP for feature construction and selection on high-dimensional classification problems. Different combinations of the constructed and/or selected features are tested and compared on seven high-dimensional gene expression problems, and different classification algorithms are used to evaluate their performance. The results show that the constructed and/or selected feature sets can significantly reduce the dimensionality and maintain or even increase the classification accuracy in most cases. The cases with overfitting occurred are analysed via the distribution of features. Further analysis is also performed to show why the constructed feature can achieve promising classification performance. This is a post-peer-review, pre-copyedit version of an article published in 'Memetic Computing'. The final authenticated version is available online at: https://doi.org/10.1007/s12293-015-0173-y. The following terms of use apply: https://www.springer.com/gp/open-access/publication-policies/aam-terms-of-use.

Download Full-text

Genetic programming for feature construction and selection in classification on high-dimensional data

10.26686/wgtn.14312465.v1 ◽

2021 ◽

Author(s):

Binh Tran ◽

Bing Xue ◽

Mengjie Zhang

Keyword(s):

Feature Selection ◽

Genetic Programming ◽

High Dimensional Data ◽

Classification Performance ◽

High Dimensional ◽

Feature Construction ◽

Classification Problems ◽

Memetic Computing ◽

Dimensional Classification ◽

High Level

Classification on high-dimensional data with thousands to tens of thousands of dimensions is a challenging task due to the high dimensionality and the quality of the feature set. The problem can be addressed by using feature selection to choose only informative features or feature construction to create new high-level features. Genetic programming (GP) using a tree-based representation can be used for both feature construction and implicit feature selection. This work presents a comprehensive study to investigate the use of GP for feature construction and selection on high-dimensional classification problems. Different combinations of the constructed and/or selected features are tested and compared on seven high-dimensional gene expression problems, and different classification algorithms are used to evaluate their performance. The results show that the constructed and/or selected feature sets can significantly reduce the dimensionality and maintain or even increase the classification accuracy in most cases. The cases with overfitting occurred are analysed via the distribution of features. Further analysis is also performed to show why the constructed feature can achieve promising classification performance. This is a post-peer-review, pre-copyedit version of an article published in 'Memetic Computing'. The final authenticated version is available online at: https://doi.org/10.1007/s12293-015-0173-y. The following terms of use apply: https://www.springer.com/gp/open-access/publication-policies/aam-terms-of-use.

Download Full-text

Feature Selection for Small Sample Sets with High Dimensional Data Using Heuristic Hybrid Approach

International Journal of Engineering ◽

10.5829/ije.2020.33.02b.05 ◽

2020 ◽

Vol 33 (2) ◽

Keyword(s):

Feature Selection ◽

High Dimensional Data ◽

Hybrid Approach ◽

Small Sample ◽

High Dimensional ◽

Selection For

Download Full-text

On the solidification of the manifold of the t-distributed stochastic neighbour embedding for condition classification of machine tools

Engineering Research Express ◽

10.1088/2631-8695/ac37f0 ◽

2021 ◽

Author(s):

Jing Wang ◽

Xiaobin Cheng ◽

Xun Wang ◽

Yan Gao ◽

Bin Liu ◽

...

Keyword(s):

Neural Networks ◽

Feature Selection ◽

Feature Vector ◽

Feature Selection Method ◽

Classification Performance ◽

Small Sample ◽

High Dimensional ◽

Vibration Signals ◽

Mapping Model ◽

The Neural Networks

Abstract t-distributed stochastic neighbour embedding (t-SNE) is of considerable interest in machining condition monitoring for feature selection. In this paper, the neural networks are introduced to solidify the manifold of the t-SNE prior to classification. This leads to the improved feature selection method, namely the Net-SNE. Conventional statistical features are first extracted from vibration signals to form a high dimensional feature vector. The redundancies in the feature vector are subsequently removed by the t-SNE. Then the neural networks build a mapping model between the high dimensional feature vector and the selected features. The new data is calculated directly using the mapping model. The experiments were conducted on a lathe and a milling machine to collect vibration signals under common working conditions. The K-nearest neighbour classifier is applied to a small sample case and a class-imbalance case to compare the classification performance with and without the Net-SNE. The results demonstrate that the Net-SNE has the advantage over the t-SNE, since it can mine the discriminative features and solidifiy the manifold in the calculation of the new data. Moreover, the proposed method significantly improves the classification accuracy by Net-SNE, along with better classification performance in data-limited situations.

Download Full-text

-Plot for Testing Spherical Symmetry for High-Dimensional Data with a Small Sample Size

Journal of Probability and Statistics ◽

10.1155/2012/728565 ◽

2012 ◽

Vol 2012 ◽

pp. 1-18

Author(s):

Jiajuan Liang

Keyword(s):

Sample Size ◽

Spherical Symmetry ◽

Graphical Method ◽

Small Sample Size ◽

High Dimensional Data ◽

Monte Carlo Study ◽

Real Data ◽

Small Sample ◽

High Dimensional ◽

Data Set

High-dimensional data with a small sample size, such as microarray data and image data, are commonly encountered in some practical problems for which many variables have to be measured but it is too costly or time consuming to repeat the measurements for many times. Analysis of this kind of data poses a great challenge for statisticians. In this paper, we develop a new graphical method for testing spherical symmetry that is especially suitable for high-dimensional data with small sample size. The new graphical method associated with the local acceptance regions can provide a quick visual perception on the assumption of spherical symmetry. The performance of the new graphical method is demonstrated by a Monte Carlo study and illustrated by a real data set.

Download Full-text

Evaluating feature selection strategies for high dimensional, small sample size datasets

2011 Annual International Conference of the IEEE Engineering in Medicine and Biology Society ◽

10.1109/iembs.2011.6090214 ◽

2011 ◽

Cited By ~ 8

Author(s):

Abhishek Golugula ◽

George Lee ◽

Anant Madabhushi

Keyword(s):

Feature Selection ◽

Sample Size ◽

Small Sample Size ◽

Small Sample ◽

High Dimensional ◽

Selection Strategies

Download Full-text