Feature Selection using Genetic Algorithm for Clustering  high Dimensional Data

One of the open problems of modern data mining is clustering high dimensional data. For this in the paper a new technique called GA-HDClustering is proposed, which works in two steps. First a GA-based feature selection algorithm is designed to determine the optimal feature subset; an optimal feature subset is consisting of important features of the entire data set next, a K-means algorithm is applied using the optimal feature subset to find the clusters. On the other hand, traditional K-means algorithm is applied on the full dimensional feature space. Finally, the result of GA-HDClustering is compared with the traditional clustering algorithm. For comparison different validity matrices such as Sum of squared error (SSE), Within Group average distance (WGAD), Between group distance (BGD), Davies-Bouldin index(DBI), are used .The GA-HDClustering uses genetic algorithm for searching an effective feature subspace in a large feature space. This large feature space is made of all dimensions of the data set. The experiment performed on the standard data set revealed that the GA-HDClustering is superior to traditional clustering algorithm.

Download Full-text

Improved Nonnegative Matrix Factorization Based Feature Selection for High Dimensional Data Analysis

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.347-350.2344 ◽

2013 ◽

Vol 347-350 ◽

pp. 2344-2348

Author(s):

Lin Cheng Jiang ◽

Wen Tang Tan ◽

Zhen Wen Wang ◽

Feng Jing Yin ◽

Bin Ge ◽

...

Keyword(s):

Feature Selection ◽

Matrix Factorization ◽

Nonnegative Matrix Factorization ◽

High Dimensional Data ◽

Nonnegative Matrix ◽

High Dimensional ◽

Feature Subset ◽

Feature Extraction Method ◽

Optimal Feature Subset ◽

Optimal Feature

Feature selection has become the focus of research areas of applications with high dimensional data. Nonnegative matrix factorization (NMF) is a good method for dimensionality reduction but it cant select the optimal feature subset for its a feature extraction method. In this paper, a two-step strategy method based on improved NMF is proposed.The first step is to get the basis of each catagory in the dataset by NMF. Added constrains can guarantee these basises are sparse and mostly distinguish from each other which can contribute to classfication. An auxiliary function is used to prove the algorithm convergent.The classic ReliefF algorithm is used to weight each feature by all the basis vectors and choose the optimal feature subset in the second step.The experimental results revealed that the proposed method can select a representive and relevant feature subset which is effective in improving the performance of the classifier.

Download Full-text

A Clustering-Guided Integer Brain Storm Optimizer for Feature Selection in High-Dimensional Data

Discrete Dynamics in Nature and Society ◽

10.1155/2021/8462493 ◽

2021 ◽

Vol 2021 ◽

pp. 1-10

Author(s):

Jia Yun-Tao ◽

Zhang Wan-Qiu ◽

He Chun-Lin

Keyword(s):

Feature Selection ◽

High Dimensional Data ◽

High Dimensional ◽

Second Phase ◽

Feature Subset ◽

Search Performance ◽

Brain Storm Optimization ◽

Real World Datasets ◽

Optimal Feature Subset ◽

Optimal Feature

For high-dimensional data with a large number of redundant features, existing feature selection algorithms still have the problem of “curse of dimensionality.” In view of this, the paper studies a new two-phase evolutionary feature selection algorithm, called clustering-guided integer brain storm optimization algorithm (IBSO-C). In the first phase, an importance-guided feature clustering method is proposed to group similar features, so that the search space in the second phase can be reduced obviously. The second phase applies oneself to finding optimal feature subset by using an improved integer brain storm optimization. Moreover, a new encoding strategy and a time-varying integer update method for individuals are proposed to improve the search performance of brain storm optimization in the second phase. Since the number of feature clusters is far smaller than the size of original features, IBSO-C can find an optimal feature subset fast. Compared with several existing algorithms on some real-world datasets, experimental results show that IBSO-C can find feature subset with high classification accuracy at less computation cost.

Download Full-text

Optimal feature subset selection in high dimensional data clustering

International Journal of Business Intelligence and Data Mining ◽

10.1504/ijbidm.2016.081866 ◽

2016 ◽

Vol 11 (3) ◽

pp. 242 ◽

Cited By ~ 2

Author(s):

Kasturi Chandrahaasan Sharmili ◽

Arul Gnanaprakasaam Chilambuchelvan

Keyword(s):

Data Clustering ◽

High Dimensional Data ◽

Subset Selection ◽

Feature Subset Selection ◽

High Dimensional ◽

Feature Subset ◽

Optimal Feature Subset ◽

Optimal Feature

Download Full-text

Optimal Feature Subset Selection in High Dimensional Data Clustering

International Journal of Business Intelligence and Data Mining ◽

10.1504/ijbidm.2016.10002432 ◽

2016 ◽

Vol 1 (1) ◽

pp. 1

Author(s):

K.C. Sharmili ◽

Arul Gnanaprakasaam Chilambuchelvan

Keyword(s):

Data Clustering ◽

High Dimensional Data ◽

Subset Selection ◽

Feature Subset Selection ◽

High Dimensional ◽

Feature Subset ◽

Optimal Feature Subset ◽

Optimal Feature

Download Full-text

Research of Red Tide Algae Images Feature Selection Method Based on ReliefF and SBS

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.507.806 ◽

2014 ◽

Vol 507 ◽

pp. 806-809

Author(s):

Shu Fang Li ◽

Qin Jia ◽

Hong Liang

Keyword(s):

Feature Selection ◽

Red Tide ◽

Feature Selection Method ◽

Original Data ◽

Feature Subset ◽

Data Set ◽

Before And After ◽

Optimal Feature Subset ◽

Optimal Feature ◽

Original Feature

In order to Red Tide algae present real-time automatic classification method of high accuracy rate, this paper proposes using ReliefF-SBS for feature selection. Namely feature analysis about Red Tide algae image original data set. And on this basis, feature selection to remove the irrelevant features and redundant features from the original feature set feature, to get the optimal feature subset, and reduce their impact on the classification accuracy. Meanwhile compare the classification results before and after SVM and KNN two kinds feature selection classifiers.

Download Full-text

A Hybrid Feature Selection Method for Effective Data Classification in Data Mining Applications

International Journal of Grid and High Performance Computing ◽

10.4018/ijghpc.2019010101 ◽

2019 ◽

Vol 11 (1) ◽

pp. 1-16

Author(s):

Ilangovan Sangaiya ◽

A. Vincent Antony Kumar

Keyword(s):

Data Mining ◽

Feature Selection ◽

Classification Accuracy ◽

Feature Selection Method ◽

Original Data ◽

Selection Method ◽

Feature Subset ◽

Data Set ◽

Optimal Feature Subset ◽

Optimal Feature

In data mining, people require feature selection to select relevant features and to remove unimportant irrelevant features from a original data set based on some evolution criteria. Filter and wrapper are the two methods used but here the authors have proposed a hybrid feature selection method to take advantage of both methods. The proposed method uses symmetrical uncertainty and genetic algorithms for selecting the optimal feature subset. This has been done so as to improve processing time by reducing the dimension of the data set without compromising the classification accuracy. This proposed hybrid algorithm is much faster and scales well to the data set in terms of selected features, classification accuracy and running time than most existing algorithms.

Download Full-text

Efficient Feature Subset Selection Algorithm for High Dimensional Data

International Journal of Electrical and Computer Engineering (IJECE) ◽

10.11591/ijece.v6i4.pp1880-1888 ◽

2016 ◽

Vol 6 (4) ◽

pp. 1880 ◽

Cited By ~ 3

Author(s):

Smita Chormunge ◽

Sudarson Jena

Keyword(s):

Feature Selection ◽

Information Gain ◽

High Dimensional Data ◽

Feature Subset Selection ◽

High Dimensional ◽

Feature Subset ◽

Selection Algorithm ◽

Feature Selection Algorithm ◽

Computational Performance ◽

Optimal Feature Subset

<p>Feature selection approach solves the dimensionality problem by removing irrelevant and redundant features. Existing Feature selection algorithms take more time to obtain feature subset for high dimensional data. This paper proposes a feature selection algorithm based on Information gain measures for high dimensional data termed as IFSA (Information gain based Feature Selection Algorithm) to produce optimal feature subset in efficient time and improve the computational performance of learning algorithms. IFSA algorithm works in two folds: First apply filter on dataset. Second produce the small feature subset by using information gain measure. Extensive experiments are carried out to compare proposed algorithm and other methods with respect to two different classifiers (Naive bayes and IBK) on microarray and text data sets. The results demonstrate that IFSA not only produces the most select feature subset in efficient time but also improves the classifier performance.</p>

Download Full-text

Efficient Feature Subset Selection Algorithm for High Dimensional Data

International Journal of Electrical and Computer Engineering (IJECE) ◽

10.11591/ijece.v6i4.9800 ◽

2016 ◽

Vol 6 (4) ◽

pp. 1880 ◽

Cited By ~ 1

Author(s):

Smita Chormunge ◽

Sudarson Jena

Keyword(s):

Feature Selection ◽

Information Gain ◽

High Dimensional Data ◽

Feature Subset Selection ◽

High Dimensional ◽

Feature Subset ◽

Selection Algorithm ◽

Feature Selection Algorithm ◽

Computational Performance ◽

Optimal Feature Subset

Download Full-text

A Novel Density-based Technique for Outlier Detection of High Dimensional Data Utilizing Full Feature Space

Information Technology And Control ◽

10.5755/j01.itc.50.1.25588 ◽

2021 ◽

Vol 50 (1) ◽

pp. 138-152

Author(s):

Mujeeb Ur Rehman ◽

Dost Muhammad Khan

Keyword(s):

Data Mining ◽

Outlier Detection ◽

High Dimensional Data ◽

Research Work ◽

Feature Space ◽

High Dimensional ◽

Data Set ◽

Data Points ◽

Low Dimensional ◽

Intrinsic Feature

Recently, anomaly detection has acquired a realistic response from data mining scientists as a graph of its reputation has increased smoothly in various practical domains like product marketing, fraud detection, medical diagnosis, fault detection and so many other fields. High dimensional data subjected to outlier detection poses exceptional challenges for data mining experts and it is because of natural problems of the curse of dimensionality and resemblance of distant and adjoining points. Traditional algorithms and techniques were experimented on full feature space regarding outlier detection. Customary methodologies concentrate largely on low dimensional data and hence show ineffectiveness while discovering anomalies in a data set comprised of a high number of dimensions. It becomes a very difficult and tiresome job to dig out anomalies present in high dimensional data set when all subsets of projections need to be explored. All data points in high dimensional data behave like similar observations because of its intrinsic feature i.e., the distance between observations approaches to zero as the number of dimensions extends towards infinity. This research work proposes a novel technique that explores deviation among all data points and embeds its findings inside well established density-based techniques. This is a state of art technique as it gives a new breadth of research towards resolving inherent problems of high dimensional data where outliers reside within clusters having different densities. A high dimensional dataset from UCI Machine Learning Repository is chosen to test the proposed technique and then its results are compared with that of density-based techniques to evaluate its efficiency.

Download Full-text