Clustering Algorithms For High Dimensional Data – A Survey Of Issues And Existing Approaches

Clustering is the most prominent data mining technique used for grouping the data into clusters based on distance measures. With the advent growth of high dimensional data such as microarray gene expression data, and grouping high dimensional data into clusters will encounter the similarity between the objects in the full dimensional space is often invalid because it contains different types of data. The process of grouping into high dimensional data into clusters is not accurate and perhaps not up to the level of expectation when the dimension of the dataset is high. It is now focusing tremendous attention towards research and development. The performance issues of the data clustering in high dimensional data it is necessary to study issues like dimensionality reduction, redundancy elimination, subspace clustering, co-clustering and data labeling for clusters are to analyzed and improved. In this paper, we presented a brief comparison of the existing algorithms that were mainly focusing at clustering on high dimensional data.

Download Full-text

Clustering High Dimensional Non-Linear Data with Denclue, Optics and Clique Algorithms

International Journal of Recent Technology and Engineering - 2 ◽

10.35940/ijrte.c6671.098319 ◽

2019 ◽

Vol 8 (3) ◽

pp. 8844-8848

Keyword(s):

Dimensional Space ◽

High Dimensional Data ◽

Microarray Gene Expression Data ◽

High Dimensional ◽

Expression Data ◽

Large Dataset ◽

Microarray Gene Expression ◽

Data Set ◽

Different Types ◽

Data Objects

Clustering is a technique in data mining which deals with huge amount of data. Clustering is intended to help a user in discovering and understanding the natural structure in a data set and abstract the meaning of large dataset. It is the task of partitioning objects of a data set into distinct groups such that two objects from one cluster are similar to each other, whereas two objects from distinct clusters are dissimilar. Clustering is unsupervised learning in which we are not provided with classes, where we can place the data objects. With the advent growth of high dimensional data such as microarray gene expression data, and grouping high dimensional data into clusters will encounter the similarity between the objects in the full dimensional space is often invalid because it contains different types of data. The process of grouping into high dimensional data into clusters is not accurate and perhaps not up to the level of expectation when the dimension of the dataset is high.

Download Full-text

Subspace Clustering of High Dimensional Data Using Differential Evolution

Nature-Inspired Algorithms for Big Data Frameworks - Advances in Computational Intelligence and Robotics ◽

10.4018/978-1-5225-5852-1.ch003 ◽

2019 ◽

pp. 47-74 ◽

Cited By ~ 1

Author(s):

Parul Agarwal ◽

Shikha Mehta

Keyword(s):

Differential Evolution ◽

Distance Measure ◽

Dimensional Space ◽

Clustering Algorithms ◽

High Dimensional Data ◽

Subspace Clustering ◽

High Dimensional ◽

Dbscan Clustering ◽

Evolution Algorithms ◽

Self Adaptive

Subspace clustering approaches cluster high dimensional data in different subspaces. It means grouping the data with different relevant subsets of dimensions. This technique has become very effective as a distance measure becomes ineffective in a high dimensional space. This chapter presents a novel evolutionary approach to a bottom up subspace clustering SUBSPACE_DE which is scalable to high dimensional data. SUBSPACE_DE uses a self-adaptive DBSCAN algorithm to perform clustering in data instances of each attribute and maximal subspaces. Self-adaptive DBSCAN clustering algorithms accept input from differential evolution algorithms. The proposed SUBSPACE_DE algorithm is tested on 14 datasets, both real and synthetic. It is compared with 11 existing subspace clustering algorithms. Evaluation metrics such as F1_Measure and accuracy are used. Performance analysis of the proposed algorithms is considerably better on a success rate ratio ranking in both accuracy and F1_Measure. SUBSPACE_DE also has potential scalability on high dimensional datasets.

Download Full-text

M-Denclue for Effective Data Clustering in High Dimensional Non-Linear Data

International Journal of Innovative Technology and Exploring Engineering - Special Issue ◽

10.35940/ijitee.a9109.119119 ◽

2019 ◽

Vol 9 (1) ◽

pp. 2925-2927

Keyword(s):

Clustering Algorithms ◽

High Dimensional Data ◽

Research Work ◽

Curse Of Dimensionality ◽

Distance Measures ◽

High Dimensional ◽

Clustering Methods ◽

Non Linear ◽

Low Dimensional ◽

Automatic Grouping

Clustering is a data mining task devoted to the automatic grouping of data based on mutual similarity. Clustering in high-dimensional spaces is a recurrent problem in many domains. It affects time complexity, space complexity, scalability and accuracy of clustering methods. Highdimensional non-linear datausually live in different low dimensional subspaces hidden in the original space. As high‐dimensional objects appear almost alike, new approaches for clustering are required. This research has focused on developing Mathematical models, techniques and clustering algorithms specifically for high‐dimensional data. The innocent growth in the fields of communication and technology, there is tremendous growth in high dimensional data spaces. As the variant of dimensions on high dimensional non-linear data increases, many clustering techniques begin to suffer from the curse of dimensionality, de-grading the quality of the results. In high dimensional non-linear data, the data becomes very sparse and distance measures become increasingly meaningless. The principal challenge for clustering high dimensional data is to overcome the “curse of dimensionality”. This research work concentrates on devising an enhanced algorithm for clustering high dimensional non-linear data.

Download Full-text

Subspace Clustering of High-Dimensional Data: An Evolutionary Approach

Applied Computational Intelligence and Soft Computing ◽

10.1155/2013/863146 ◽

2013 ◽

Vol 2013 ◽

pp. 1-12 ◽

Cited By ~ 3

Author(s):

Singh Vijendra ◽

Sahoo Laxman

Keyword(s):

Clustering Algorithm ◽

Dimensional Space ◽

Clustering Algorithms ◽

High Dimensional Data ◽

Subspace Clustering ◽

High Dimensional ◽

Data Sets ◽

Real World Data ◽

Data Set ◽

Data Points

Clustering high-dimensional data has been a major challenge due to the inherent sparsity of the points. Most existing clustering algorithms become substantially inefficient if the required similarity measure is computed between data points in the full-dimensional space. In this paper, we have presented a robust multi objective subspace clustering (MOSCL) algorithm for the challenging problem of high-dimensional clustering. The first phase of MOSCL performs subspace relevance analysis by detecting dense and sparse regions with their locations in data set. After detection of dense regions it eliminates outliers. MOSCL discovers subspaces in dense regions of data set and produces subspace clusters. In thorough experiments on synthetic and real-world data sets, we demonstrate that MOSCL for subspace clustering is superior to PROCLUS clustering algorithm. Additionally we investigate the effects of first phase for detecting dense regions on the results of subspace clustering. Our results indicate that removing outliers improves the accuracy of subspace clustering. The clustering results are validated by clustering error (CE) distance on various data sets. MOSCL can discover the clusters in all subspaces with high quality, and the efficiency of MOSCL outperforms PROCLUS.

Download Full-text

Impact of Partition Based Clustering Algorithms to Cluster Samples in Microarray Gene Expression Data

Learning and Analytics in Intelligent Systems - Intelligent Techniques and Applications in Science and Technology ◽

10.1007/978-3-030-42363-6_77 ◽

2020 ◽

pp. 659-668

Author(s):

Chandra Das ◽

Shilpi Bose ◽

Debanjana Karmakar ◽

Agniswar Roy ◽

Natasha Ghosh ◽

...

Keyword(s):

Gene Expression ◽

Gene Expression Data ◽

Clustering Algorithms ◽

Microarray Gene Expression Data ◽

Expression Data ◽

Microarray Gene Expression ◽

Microarray Gene

Download Full-text

Gene Ontology Analysis of 3D Microarray Gene Expression Data using Hybrid PSO Optimization

International Journal of Innovative Technology and Exploring Engineering - Special Issue ◽

10.35940/ijitee.k1261.0981119 ◽

2019 ◽

Vol 8 (11) ◽

pp. 3890-3896

Keyword(s):

Gene Expression ◽

Gene Ontology ◽

Gene Expression Data ◽

Biological Significance ◽

High Volume ◽

Microarray Gene Expression Data ◽

Expression Data ◽

Data Mining Technique ◽

Microarray Gene Expression ◽

3D Gene

At present, triclustering is the well known data mining technique for analysis of 3D gene expression data (GST). Triclustering is a simultaneously clustering of subset of Gene (G), subset of Sample (S), and over a subset of Time point (T). Triclustering approach identifies a coherent pattern in the 3D gene expression data using Mean Correlation Value (MCV). In this chapter, Hybrid PSO based algorithm is developed for triclustering of 3D gene expression data. This algorithm can effectively find the coherent pattern with high volume of a tricluster. The experimental study is conducted on yeast cycle dataset to study the biological significance of the coherent tricluster using gene ontology tool

Download Full-text

INCORPORATING FEATURE RANKING AND EVOLUTIONARY METHODS FOR THE CLASSIFICATION OF HIGH-DIMENSIONAL DNA MICROARRAY GENE EXPRESSION DATA

Australasian Medical Journal ◽

10.21767/amj.2013.1641 ◽

2013 ◽

Vol 06 (05) ◽

Author(s):

Mani Abedini ◽

Michael Kirley ◽

Raymond Chiong

Keyword(s):

Gene Expression ◽

Dna Microarray ◽

Gene Expression Data ◽

Microarray Gene Expression Data ◽

High Dimensional ◽

Feature Ranking ◽

Expression Data ◽

Microarray Gene Expression ◽

Microarray Gene

Download Full-text

Effectiveness of Different Partition Based Clustering Algorithms for Estimation of Missing Values in Microarray Gene Expression Data

Advances in Computing and Information Technology - Advances in Intelligent Systems and Computing ◽

10.1007/978-3-642-31552-7_5 ◽

2013 ◽

pp. 37-47 ◽

Cited By ~ 2

Author(s):

Shilpi Bose ◽

Chandra Das ◽

Abirlal Chakraborty ◽

Samiran Chattopadhyay

Keyword(s):

Gene Expression ◽

Gene Expression Data ◽

Missing Values ◽

Clustering Algorithms ◽

Microarray Gene Expression Data ◽

Expression Data ◽

Microarray Gene Expression ◽

Microarray Gene

Download Full-text

Projected Clustering for Biological Data Analysis

Encyclopedia of Data Warehousing and Mining, Second Edition ◽

10.4018/978-1-60566-010-3.ch247 ◽

2011 ◽

pp. 1617-1622

Author(s):

Ping Deng ◽

Qingkai Ma ◽

Weili Wu

Keyword(s):

Nearest Neighbor ◽

Dimensional Space ◽

Clustering Algorithms ◽

Biological Data ◽

High Dimensional ◽

Projected Clustering ◽

Cluster Data ◽

Biological Data Analysis ◽

Data Points ◽

Entire Dataset

Clustering can be considered as the most important unsupervised learning problem. It has been discussed thoroughly by both statistics and database communities due to its numerous applications in problems such as classification, machine learning, and data mining. A summary of clustering techniques can be found in (Berkhin, 2002). Most known clustering algorithms such as DBSCAN (Easter, Kriegel, Sander, & Xu, 1996) and CURE (Guha, Rastogi, & Shim, 1998) cluster data points based on full dimensions. When the dimensional space grows higher, the above algorithms lose their efficiency and accuracy because of the so-called “curse of dimensionality”. It is shown in (Beyer, Goldstein, Ramakrishnan, & Shaft, 1999) that computing the distance based on full dimensions is not meaningful in high dimensional space since the distance of a point to its nearest neighbor approaches the distance to its farthest neighbor as dimensionality increases. Actually, natural clusters might exist in subspaces. Data points in different clusters may be correlated with respect to different subsets of dimensions. In order to solve this problem, feature selection (Kohavi & Sommerfield, 1995) and dimension reduction (Raymer, Punch, Goodman, Kuhn, & Jain, 2000) have been proposed to find the closely correlated dimensions for all the data and the clusters in such dimensions. Although both methods reduce the dimensionality of the space before clustering, the case where clusters may exist in different subspaces of full dimensions is not handled well. Projected clustering has been proposed recently to effectively deal with high dimensionalities. Finding clusters and their relevant dimensions are the objectives of projected clustering algorithms. Instead of projecting the entire dataset on the same subspace, projected clustering focuses on finding specific projection for each cluster such that the similarity is reserved as much as possible.

Download Full-text