A meta-heuristic density-based subspace clustering algorithm for high-dimensional data

Recently emerging approaches to high-throughput phenotyping have become important tools in unraveling the biological basis of agronomically and medically important phenotypes. These experiments produce very large sets of either low or high-dimensional data. Finding clusters in the entire space of high-dimensional data (HDD) is a challenging task, because the relative distances between any two objects converge to zero with increasing dimensionality. Additionally, real data may not be mathematically well behaved. Finally, many clusters are expected on biological grounds to be "natural" -- that is, to have irregular, overlapping boundaries in different subsets of the dimensions. More precisely, the natural clusters of the data could differ in shape, size, density, and dimensionality; and they might not be disjoint. In principle, clustering such data could be done by dimension reduction methods. However, these methods convert many dimensions to a smaller set of dimensions that make the clustering results difficult to interpret and may also lead to a significant loss of information. Another possible approach is to find subspaces (subsets of dimensions) in the entire data space of the HDD. However, the existing subspace methods don't discover natural clusters. Therefore, in this dissertation I propose a novel data preprocessing method, demonstrating that a group of phenotypes are interdependent, and propose a novel density-based subspace clustering algorithm for high-dimensional data, called Dynamic Locally Density Adaptive Scalable Subspace Clustering (DynaDASC). This algorithm is relatively locally density adaptive, scalable, dynamic, and nonmetric in nature, and discovers natural clusters.

Subspace Clustering of High-Dimensional Data: An Evolutionary Approach

Applied Computational Intelligence and Soft Computing ◽

10.1155/2013/863146 ◽

2013 ◽

Vol 2013 ◽

pp. 1-12 ◽

Cited By ~ 3

Author(s):

Singh Vijendra ◽

Sahoo Laxman

Keyword(s):

Clustering Algorithm ◽

Dimensional Space ◽

Clustering Algorithms ◽

High Dimensional Data ◽

Subspace Clustering ◽

High Dimensional ◽

Data Sets ◽

Real World Data ◽

Data Set ◽

Data Points

Clustering high-dimensional data has been a major challenge due to the inherent sparsity of the points. Most existing clustering algorithms become substantially inefficient if the required similarity measure is computed between data points in the full-dimensional space. In this paper, we have presented a robust multi objective subspace clustering (MOSCL) algorithm for the challenging problem of high-dimensional clustering. The first phase of MOSCL performs subspace relevance analysis by detecting dense and sparse regions with their locations in data set. After detection of dense regions it eliminates outliers. MOSCL discovers subspaces in dense regions of data set and produces subspace clusters. In thorough experiments on synthetic and real-world data sets, we demonstrate that MOSCL for subspace clustering is superior to PROCLUS clustering algorithm. Additionally we investigate the effects of first phase for detecting dense regions on the results of subspace clustering. Our results indicate that removing outliers improves the accuracy of subspace clustering. The clustering results are validated by clustering error (CE) distance on various data sets. MOSCL can discover the clusters in all subspaces with high quality, and the efficiency of MOSCL outperforms PROCLUS.

(EDSFCA) Efficient Document Subspace Clustering in High-Dimensional Data using Fast Clustering Algorithm

International Journal of Computer Sciences and Engineering ◽

10.26438/ijcse/v7i2.10101015 ◽

2019 ◽

Vol 7 (2) ◽

pp. 1010-1015

Author(s):

adhika K R ◽

Pushpa C N ◽

Thriveni J ◽

Venugopal K R

Keyword(s):

Clustering Algorithm ◽

High Dimensional Data ◽

Subspace Clustering ◽

High Dimensional

A Fast Clustering Algorithm for Large-scale and High Dimensional Data

ACTA AUTOMATICA SINICA ◽

10.3724/sp.j.1004.2009.00859 ◽

2009 ◽

Vol 35 (7) ◽

pp. 859-866

Author(s):

Ming LIU ◽

Xiao-Long WANG ◽

Yuan-Chao LIU

Keyword(s):

Large Scale ◽

Clustering Algorithm ◽

High Dimensional Data ◽

High Dimensional

Scalable hierarchical clustering by composition rank vector encoding and tree structure

10.1101/2020.04.12.038026 ◽

2020 ◽

Author(s):

Xiao Lai ◽

Pu Tian

Keyword(s):

Machine Learning ◽

Hierarchical Clustering ◽

Clustering Algorithm ◽

High Dimensional Data ◽

Machine Learning Algorithms ◽

Tree Structure ◽

Supervised Machine Learning ◽

High Dimensional ◽

Rank Vector ◽

Nonlinear Correlations

AbstractSupervised machine learning, especially deep learning based on a wide variety of neural network architectures, have contributed tremendously to fields such as marketing, computer vision and natural language processing. However, development of un-supervised machine learning algorithms has been a bottleneck of artificial intelligence. Clustering is a fundamental unsupervised task in many different subjects. Unfortunately, no present algorithm is satisfactory for clustering of high dimensional data with strong nonlinear correlations. In this work, we propose a simple and highly efficient hierarchical clustering algorithm based on encoding by composition rank vectors and tree structure, and demonstrate its utility with clustering of protein structural domains. No record comparison, which is an expensive and essential common step to all present clustering algorithms, is involved. Consequently, it achieves linear time and space computational complexity hierarchical clustering, thus applicable to arbitrarily large datasets. The key factor in this algorithm is definition of composition, which is dependent upon physical nature of target data and therefore need to be constructed case by case. Nonetheless, the algorithm is general and applicable to any high dimensional data with strong nonlinear correlations. We hope this algorithm to inspire a rich research field of encoding based clustering well beyond composition rank vector trees.

Fuzzy C Means Clustering Algorithm for High Dimensional Data Using Feature Subset Selection Technique

IOSR Journal of Computer Engineering ◽

10.9790/0661-16226469 ◽

2014 ◽

Vol 16 (2) ◽

pp. 64-69 ◽

Cited By ~ 1

Author(s):

N. Manjula ◽

◽

S. Pandiarajan ◽

J. Jagadeesan

Keyword(s):

Clustering Algorithm ◽

High Dimensional Data ◽

Subset Selection ◽

Feature Subset Selection ◽

High Dimensional ◽

Feature Subset ◽

Selection Technique ◽

Fuzzy C Means ◽

Fuzzy C Means Clustering