scholarly journals Subspace Clustering of High-Dimensional Data: An Evolutionary Approach

2013 ◽  
Vol 2013 ◽  
pp. 1-12 ◽  
Author(s):  
Singh Vijendra ◽  
Sahoo Laxman

Clustering high-dimensional data has been a major challenge due to the inherent sparsity of the points. Most existing clustering algorithms become substantially inefficient if the required similarity measure is computed between data points in the full-dimensional space. In this paper, we have presented a robust multi objective subspace clustering (MOSCL) algorithm for the challenging problem of high-dimensional clustering. The first phase of MOSCL performs subspace relevance analysis by detecting dense and sparse regions with their locations in data set. After detection of dense regions it eliminates outliers. MOSCL discovers subspaces in dense regions of data set and produces subspace clusters. In thorough experiments on synthetic and real-world data sets, we demonstrate that MOSCL for subspace clustering is superior to PROCLUS clustering algorithm. Additionally we investigate the effects of first phase for detecting dense regions on the results of subspace clustering. Our results indicate that removing outliers improves the accuracy of subspace clustering. The clustering results are validated by clustering error (CE) distance on various data sets. MOSCL can discover the clusters in all subspaces with high quality, and the efficiency of MOSCL outperforms PROCLUS.

2018 ◽  
Vol 30 (6) ◽  
pp. 1624-1646 ◽  
Author(s):  
Qidong Liu ◽  
Ruisheng Zhang ◽  
Zhili Zhao ◽  
Zhenghai Wang ◽  
Mengyao Jiao ◽  
...  

Minimax similarity stresses the connectedness of points via mediating elements rather than favoring high mutual similarity. The grouping principle yields superior clustering results when mining arbitrarily-shaped clusters in data. However, it is not robust against noises and outliers in the data. There are two main problems with the grouping principle: first, a single object that is far away from all other objects defines a separate cluster, and second, two connected clusters would be regarded as two parts of one cluster. In order to solve such problems, we propose robust minimum spanning tree (MST)-based clustering algorithm in this letter. First, we separate the connected objects by applying a density-based coarsening phase, resulting in a low-rank matrix in which the element denotes the supernode by combining a set of nodes. Then a greedy method is presented to partition those supernodes through working on the low-rank matrix. Instead of removing the longest edges from MST, our algorithm groups the data set based on the minimax similarity. Finally, the assignment of all data points can be achieved through their corresponding supernodes. Experimental results on many synthetic and real-world data sets show that our algorithm consistently outperforms compared clustering algorithms.


Author(s):  
Parul Agarwal ◽  
Shikha Mehta

Subspace clustering approaches cluster high dimensional data in different subspaces. It means grouping the data with different relevant subsets of dimensions. This technique has become very effective as a distance measure becomes ineffective in a high dimensional space. This chapter presents a novel evolutionary approach to a bottom up subspace clustering SUBSPACE_DE which is scalable to high dimensional data. SUBSPACE_DE uses a self-adaptive DBSCAN algorithm to perform clustering in data instances of each attribute and maximal subspaces. Self-adaptive DBSCAN clustering algorithms accept input from differential evolution algorithms. The proposed SUBSPACE_DE algorithm is tested on 14 datasets, both real and synthetic. It is compared with 11 existing subspace clustering algorithms. Evaluation metrics such as F1_Measure and accuracy are used. Performance analysis of the proposed algorithms is considerably better on a success rate ratio ranking in both accuracy and F1_Measure. SUBSPACE_DE also has potential scalability on high dimensional datasets.


2020 ◽  
Vol 10 (1) ◽  
Author(s):  
Michele Allegra ◽  
Elena Facco ◽  
Francesco Denti ◽  
Alessandro Laio ◽  
Antonietta Mira

Abstract One of the founding paradigms of machine learning is that a small number of variables is often sufficient to describe high-dimensional data. The minimum number of variables required is called the intrinsic dimension (ID) of the data. Contrary to common intuition, there are cases where the ID varies within the same data set. This fact has been highlighted in technical discussions, but seldom exploited to analyze large data sets and obtain insight into their structure. Here we develop a robust approach to discriminate regions with different local IDs and segment the points accordingly. Our approach is computationally efficient and can be proficiently used even on large data sets. We find that many real-world data sets contain regions with widely heterogeneous dimensions. These regions host points differing in core properties: folded versus unfolded configurations in a protein molecular dynamics trajectory, active versus non-active regions in brain imaging data, and firms with different financial risk in company balance sheets. A simple topological feature, the local ID, is thus sufficient to achieve an unsupervised segmentation of high-dimensional data, complementary to the one given by clustering algorithms.


Author(s):  
UREERAT WATTANACHON ◽  
CHIDCHANOK LURSINSAP

Existing clustering algorithms, such as single-link clustering, k-means, CURE, and CSM are designed to find clusters based on predefined parameters specified by users. These algorithms may be unsuccessful if the choice of parameters is inappropriate with respect to the data set being clustered. Most of these algorithms work very well for compact and hyper-spherical clusters. In this paper, a new hybrid clustering algorithm called Self-Partition and Self-Merging (SPSM) is proposed. The SPSM algorithm partitions the input data set into several subclusters in the first phase and, then, removes the noisy data in the second phase. In the third phase, the normal subclusters are continuously merged to form the larger clusters based on the inter-cluster distance and intra-cluster distance criteria. From the experimental results, the SPSM algorithm is very efficient to handle the noisy data set, and to cluster the data sets of arbitrary shapes of different density. Several examples for color image show the versatility of the proposed method and compare with results described in the literature for the same images. The computational complexity of the SPSM algorithm is O(N2), where N is the number of data points.


Author(s):  
Liping Jing ◽  
Michael K. Ng ◽  
Joshua Zhexue Huang

High dimensional data is a phenomenon in real-world data mining applications. Text data is a typical example. In text mining, a text document is viewed as a vector of terms whose dimension is equal to the total number of unique terms in a data set, which is usually in thousands. High dimensional data occurs in business as well. In retails, for example, to effectively manage supplier relationship, suppliers are often categorized according to their business behaviors (Zhang, Huang, Qian, Xu, & Jing, 2006). The supplier’s behavior data is high dimensional, which contains thousands of attributes to describe the supplier’s behaviors, including product items, ordered amounts, order frequencies, product quality and so forth. One more example is DNA microarray data. Clustering high-dimensional data requires special treatment (Swanson, 1990; Jain, Murty, & Flynn, 1999; Cai, He, & Han, 2005; Kontaki, Papadopoulos & Manolopoulos., 2007), although various methods for clustering are available (Jain & Dubes, 1988). One type of clustering methods for high dimensional data is referred to as subspace clustering, aiming at finding clusters from subspaces instead of the entire data space. In a subspace clustering, each cluster is a set of objects identified by a subset of dimensions and different clusters are represented in different subsets of dimensions. Soft subspace clustering considers that different dimensions make different contributions to the identification of objects in a cluster. It represents the importance of a dimension as a weight that can be treated as the degree of the dimension in contribution to the cluster. Soft subspace clustering can find the cluster memberships of objects and identify the subspace of each cluster in the same clustering process.


2020 ◽  
pp. 1-12
Author(s):  
Xiaoguang Gao

The unbalanced development strategy makes the regional development unbalanced. Therefore, in the development process, resources must be effectively utilized according to the level and characteristics of each region. Considering the resource and environmental constraints, this paper measures and analyzes China’s green economic efficiency and green total factor productivity. Moreover, by expounding the characteristics of high-dimensional data, this paper points out the problems of traditional clustering algorithms in high-dimensional data clustering. This paper proposes a density peak clustering algorithm based on sampling and residual squares, which is suitable for high-dimensional large data sets. The algorithm finds abnormal points and boundary points by identifying halo points, and finally determines clusters. In addition, from the experimental comparison on the data set, it can be seen that the improved algorithm is better than the DPC algorithm in both time complexity and clustering results. Finally, this article analyzes data based on actual cases. The research results show that the method proposed in this paper is effective.


Author(s):  
Momotaz Begum ◽  
Bimal Chandra Das ◽  
Md. Zakir Hossain ◽  
Antu Saha ◽  
Khaleda Akther Papry

<p>Manipulating high-dimensional data is a major research challenge in the field of computer science in recent years. To classify this data, a lot of clustering algorithms have already been proposed. Kohonen self-organizing map (KSOM) is one of them. However, this algorithm has some drawbacks like overlapping clusters and non-linear separability problems. Therefore, in this paper, we propose an improved KSOM (I-KSOM) to reduce the problems that measures distances among objects using EISEN Cosine correlation formula. So far as we know, no previous work has used EISEN Cosine correlation distance measurements to classify high-dimensional data sets. To the robustness of the proposed KSOM, we carry out the experiments on several popular datasets like Iris, Seeds, Glass, Vertebral column, and Wisconsin breast cancer data sets. Our proposed algorithm shows better result compared to the existing original KSOM and another modified KSOM in terms of predictive performance with topographic and quantization error.</p>


2019 ◽  
Author(s):  
Justin L. Balsor ◽  
David G. Jones ◽  
Kathryn M. Murphy

AbstractNew techniques for quantifying large numbers of proteins or genes are transforming the study of plasticity mechanisms in visual cortex (V1) into the era of big data. With those changes comes the challenge of applying new analytical methods designed for high-dimensional data. Studies of V1, however, can take advantage of the known functions that many proteins have in regulating experience-dependent plasticity to facilitate linking big data analyses with neurobiological functions. Here we discuss two workflows and provide example R code for analyzing high-dimensional changes in a group of proteins (or genes) using two data sets. The first data set includes 7 neural proteins, 9 visual conditions, and 3 regions in V1 from an animal model for amblyopia. The second data set includes 23 neural proteins and 31 ages (20d-80yrs) from human post-mortem samples of V1. Each data set presents different challenges and we describe using PCA, tSNE, and various clustering algorithms including sparse high-dimensional clustering. Also, we describe a new approach for identifying high-dimensional features and using them to construct a plasticity phenotype that identifies neurobiological differences among clusters. We include an R package “v1hdexplorer” that aggregates the various coding packages and custom visualization scripts written in R Studio.


2018 ◽  
Vol 8 (2) ◽  
pp. 377-406
Author(s):  
Almog Lahav ◽  
Ronen Talmon ◽  
Yuval Kluger

Abstract A fundamental question in data analysis, machine learning and signal processing is how to compare between data points. The choice of the distance metric is specifically challenging for high-dimensional data sets, where the problem of meaningfulness is more prominent (e.g. the Euclidean distance between images). In this paper, we propose to exploit a property of high-dimensional data that is usually ignored, which is the structure stemming from the relationships between the coordinates. Specifically, we show that organizing similar coordinates in clusters can be exploited for the construction of the Mahalanobis distance between samples. When the observable samples are generated by a nonlinear transformation of hidden variables, the Mahalanobis distance allows the recovery of the Euclidean distances in the hidden space. We illustrate the advantage of our approach on a synthetic example where the discovery of clusters of correlated coordinates improves the estimation of the principal directions of the samples. Our method was applied to real data of gene expression for lung adenocarcinomas (lung cancer). By using the proposed metric we found a partition of subjects to risk groups with a good separation between their Kaplan–Meier survival plot.


Sign in / Sign up

Export Citation Format

Share Document