scholarly journals A meta-heuristic density-based subspace clustering algorithm for high-dimensional data

2021 ◽  
Author(s):  
Parul Agarwal ◽  
Shikha Mehta ◽  
Ajith Abraham
2017 ◽  
Author(s):  
◽  
Avimanyou Kumar Vatsa

Recently emerging approaches to high-throughput phenotyping have become important tools in unraveling the biological basis of agronomically and medically important phenotypes. These experiments produce very large sets of either low or high-dimensional data. Finding clusters in the entire space of high-dimensional data (HDD) is a challenging task, because the relative distances between any two objects converge to zero with increasing dimensionality. Additionally, real data may not be mathematically well behaved. Finally, many clusters are expected on biological grounds to be "natural" -- that is, to have irregular, overlapping boundaries in different subsets of the dimensions. More precisely, the natural clusters of the data could differ in shape, size, density, and dimensionality; and they might not be disjoint. In principle, clustering such data could be done by dimension reduction methods. However, these methods convert many dimensions to a smaller set of dimensions that make the clustering results difficult to interpret and may also lead to a significant loss of information. Another possible approach is to find subspaces (subsets of dimensions) in the entire data space of the HDD. However, the existing subspace methods don't discover natural clusters. Therefore, in this dissertation I propose a novel data preprocessing method, demonstrating that a group of phenotypes are interdependent, and propose a novel density-based subspace clustering algorithm for high-dimensional data, called Dynamic Locally Density Adaptive Scalable Subspace Clustering (DynaDASC). This algorithm is relatively locally density adaptive, scalable, dynamic, and nonmetric in nature, and discovers natural clusters.


2013 ◽  
Vol 2013 ◽  
pp. 1-12 ◽  
Author(s):  
Singh Vijendra ◽  
Sahoo Laxman

Clustering high-dimensional data has been a major challenge due to the inherent sparsity of the points. Most existing clustering algorithms become substantially inefficient if the required similarity measure is computed between data points in the full-dimensional space. In this paper, we have presented a robust multi objective subspace clustering (MOSCL) algorithm for the challenging problem of high-dimensional clustering. The first phase of MOSCL performs subspace relevance analysis by detecting dense and sparse regions with their locations in data set. After detection of dense regions it eliminates outliers. MOSCL discovers subspaces in dense regions of data set and produces subspace clusters. In thorough experiments on synthetic and real-world data sets, we demonstrate that MOSCL for subspace clustering is superior to PROCLUS clustering algorithm. Additionally we investigate the effects of first phase for detecting dense regions on the results of subspace clustering. Our results indicate that removing outliers improves the accuracy of subspace clustering. The clustering results are validated by clustering error (CE) distance on various data sets. MOSCL can discover the clusters in all subspaces with high quality, and the efficiency of MOSCL outperforms PROCLUS.


2009 ◽  
Vol 35 (7) ◽  
pp. 859-866
Author(s):  
Ming LIU ◽  
Xiao-Long WANG ◽  
Yuan-Chao LIU

2020 ◽  
Author(s):  
Xiao Lai ◽  
Pu Tian

AbstractSupervised machine learning, especially deep learning based on a wide variety of neural network architectures, have contributed tremendously to fields such as marketing, computer vision and natural language processing. However, development of un-supervised machine learning algorithms has been a bottleneck of artificial intelligence. Clustering is a fundamental unsupervised task in many different subjects. Unfortunately, no present algorithm is satisfactory for clustering of high dimensional data with strong nonlinear correlations. In this work, we propose a simple and highly efficient hierarchical clustering algorithm based on encoding by composition rank vectors and tree structure, and demonstrate its utility with clustering of protein structural domains. No record comparison, which is an expensive and essential common step to all present clustering algorithms, is involved. Consequently, it achieves linear time and space computational complexity hierarchical clustering, thus applicable to arbitrarily large datasets. The key factor in this algorithm is definition of composition, which is dependent upon physical nature of target data and therefore need to be constructed case by case. Nonetheless, the algorithm is general and applicable to any high dimensional data with strong nonlinear correlations. We hope this algorithm to inspire a rich research field of encoding based clustering well beyond composition rank vector trees.


Sign in / Sign up

Export Citation Format

Share Document