Subspace clustering using ensembles of K-subspaces

Author(s):  
John Lipor ◽  
David Hong ◽  
Yan Shuo Tan ◽  
Laura Balzano

Abstract Subspace clustering is the unsupervised grouping of points lying near a union of low-dimensional linear subspaces. Algorithms based directly on geometric properties of such data tend to either provide poor empirical performance, lack theoretical guarantees or depend heavily on their initialization. We present a novel geometric approach to the subspace clustering problem that leverages ensembles of the $K$-subspace (KSS) algorithm via the evidence accumulation clustering framework. Our algorithm, referred to as ensemble $K$-subspaces (EKSSs), forms a co-association matrix whose $(i,j)$th entry is the number of times points $i$ and $j$ are clustered together by several runs of KSS with random initializations. We prove general recovery guarantees for any algorithm that forms an affinity matrix with entries close to a monotonic transformation of pairwise absolute inner products. We then show that a specific instance of EKSS results in an affinity matrix with entries of this form, and hence our proposed algorithm can provably recover subspaces under similar conditions to state-of-the-art algorithms. The finding is, to the best of our knowledge, the first recovery guarantee for evidence accumulation clustering and for KSS variants. We show on synthetic data that our method performs well in the traditionally challenging settings of subspaces with large intersection, subspaces with small principal angles and noisy data. Finally, we evaluate our algorithm on six common benchmark datasets and show that unlike existing methods, EKSS achieves excellent empirical performance when there are both a small and large number of points per subspace.

2021 ◽  
Vol 12 (4) ◽  
pp. 1-25
Author(s):  
Stanley Ebhohimhen Abhadiomhen ◽  
Zhiyang Wang ◽  
Xiangjun Shen ◽  
Jianping Fan

Multi-view subspace clustering (MVSC) finds a shared structure in latent low-dimensional subspaces of multi-view data to enhance clustering performance. Nonetheless, we observe that most existing MVSC methods neglect the diversity in multi-view data by considering only the common knowledge to find a shared structure either directly or by merging different similarity matrices learned for each view. In the presence of noise, this predefined shared structure becomes a biased representation of the different views. Thus, in this article, we propose a MVSC method based on coupled low-rank representation to address the above limitation. Our method first obtains a low-rank representation for each view, constrained to be a linear combination of the view-specific representation and the shared representation by simultaneously encouraging the sparsity of view-specific one. Then, it uses the k -block diagonal regularizer to learn a manifold recovery matrix for each view through respective low-rank matrices to recover more manifold structures from them. In this way, the proposed method can find an ideal similarity matrix by approximating clustering projection matrices obtained from the recovery structures. Hence, this similarity matrix denotes our clustering structure with exactly k connected components by applying a rank constraint on the similarity matrix’s relaxed Laplacian matrix to avoid spectral post-processing of the low-dimensional embedding matrix. The core of our idea is such that we introduce dynamic approximation into the low-rank representation to allow the clustering structure and the shared representation to guide each other to learn cleaner low-rank matrices that would lead to a better clustering structure. Therefore, our approach is notably different from existing methods in which the local manifold structure of data is captured in advance. Extensive experiments on six benchmark datasets show that our method outperforms 10 similar state-of-the-art compared methods in six evaluation metrics.


2017 ◽  
Vol 2017 ◽  
pp. 1-9 ◽  
Author(s):  
Binbin Zhang ◽  
Weiwei Wang ◽  
Xiangchu Feng

Subspace clustering aims to group a set of data from a union of subspaces into the subspace from which it was drawn. It has become a popular method for recovering the low-dimensional structure underlying high-dimensional dataset. The state-of-the-art methods construct an affinity matrix based on the self-representation of the dataset and then use a spectral clustering method to obtain the final clustering result. These methods show that sparsity and grouping effect of the affinity matrix are important in recovering the low-dimensional structure. In this work, we propose a weighted sparse penalty and a weighted grouping effect penalty in modeling the self-representation of data points. The experimental results on Extended Yale B, USPS, and Berkeley 500 image segmentation datasets show that the proposed model is more effective than state-of-the-art methods in revealing the subspace structure underlying high-dimensional dataset.


2021 ◽  
pp. 1-15
Author(s):  
Zhixuan xu ◽  
Caikou Chen ◽  
Guojiang Han ◽  
Jun Gao

As a successful improvement on Low Rank Representation (LRR), Latent Low Rank Representation (LatLRR) has been one of the state-of-the-art models for subspace clustering due to the capability of discovering the low dimensional subspace structures of data, especially when the data samples are insufficient and/or extremely corrupted. However, the LatLRR method does not consider the nonlinear geometric structures within data, which leads to the loss of the locality information among data in the learning phase. Moreover, the coefficients of the learnt representation matrix can be negative, which lack the interpretability. To solve the above drawbacks of LatLRR, this paper introduces Laplacian, sparsity and non-negativity to LatLRR model and proposes a novel subspace clustering method, termed latent low rank representation with non-negative, sparse and laplacian constraints (NNSLLatLRR), in which we jointly take into account non-negativity, sparsity and laplacian properties of the learnt representation. As a result, the NNSLLatLRR can not only capture the global low dimensional structure and intrinsic non-linear geometric information of the data, but also enhance the interpretability of the learnt representation. Extensive experiments on two face benchmark datasets and a handwritten digit dataset show that our proposed method outperforms existing state-of-the-art subspace clustering methods.


2021 ◽  
Vol 15 ◽  
pp. 174830262110249
Author(s):  
Cong-Zhe You ◽  
Zhen-Qiu Shu ◽  
Hong-Hui Fan

Recently, in the area of artificial intelligence and machine learning, subspace clustering of multi-view data is a research hotspot. The goal is to divide data samples from different sources into different groups. We proposed a new subspace clustering method for multi-view data which termed as Non-negative Sparse Laplacian regularized Latent Multi-view Subspace Clustering (NSL2MSC) in this paper. The method proposed in this paper learns the latent space representation of multi view data samples, and performs the data reconstruction on the latent space. The algorithm can cluster data in the latent representation space and use the relationship of different views. However, the traditional representation-based method does not consider the non-linear geometry inside the data, and may lose the local and similar information between the data in the learning process. By using the graph regularization method, we can not only capture the global low dimensional structural features of data, but also fully capture the nonlinear geometric structure information of data. The experimental results show that the proposed method is effective and its performance is better than most of the existing alternatives.


Author(s):  
Antonis F. Lentzakis ◽  
Ravi Seshadri ◽  
Moshe Ben-Akiva

2020 ◽  
pp. 147387162097820
Author(s):  
Haili Zhang ◽  
Pu Wang ◽  
Xuejin Gao ◽  
Yongsheng Qi ◽  
Huihui Gao

T-distributed stochastic neighbor embedding (t-SNE) is an effective visualization method. However, it is non-parametric and cannot be applied to steaming data or online scenarios. Although kernel t-SNE provides an explicit projection from a high-dimensional data space to a low-dimensional feature space, some outliers are not well projected. In this paper, bi-kernel t-SNE is proposed for out-of-sample data visualization. Gaussian kernel matrices of the input and feature spaces are used to approximate the explicit projection. Then principal component analysis is applied to reduce the dimensionality of the feature kernel matrix. Thus, the difference between inliers and outliers is revealed. And any new sample can be well mapped. The performance of the proposed method for out-of-sample projection is tested on several benchmark datasets by comparing it with other state-of-the-art algorithms.


2008 ◽  
Vol 18 (04) ◽  
pp. 279-292 ◽  
Author(s):  
WITOLD PEDRYCZ ◽  
PARTAB RAI ◽  
JOZEF ZURADA

We develop a new approach to the design of neural networks, which utilizes a collaborative framework of knowledge-driven experience. In contrast to the "standard" way of developing neural networks, which explicitly exploits experimental data, this approach incorporates a mechanism of knowledge-driven experience. The essence of the proposed scheme of learning is to take advantage of the parameters (connections) of neural networks built in the past for the same phenomenon (which might also exhibit some variability over time or space) for which are interested to construct the network on a basis of currently available data. We establish a conceptual and algorithmic framework to reconcile these two essential sources of information (data and knowledge) in the process of the development of the network. To make a presentation more focused and come up with a detailed quantification of the resulting architecture, we concentrate on the experience-based design of radial basis function neural networks (RBFNNs). We introduce several performance indexes to quantify an effect of utilization of the knowledge residing within the connections of the networks and establish an optimal level of their use. Experimental results are presented for low-dimensional synthetic data and selected datasets available at the Machine Learning Repository.


2017 ◽  
Vol 89 ◽  
pp. 67-72 ◽  
Author(s):  
Daming Shi ◽  
Jun Wang ◽  
Dansong Cheng ◽  
Junbin Gao

2020 ◽  
Author(s):  
Grigoriy Gogoshin ◽  
Sergio Branciamore ◽  
Andrei S. Rodin

AbstractBayesian Network (BN) modeling is a prominent and increasingly popular computational systems biology method. It aims to construct probabilistic networks from the large heterogeneous biological datasets that reflect the underlying networks of biological relationships. Currently, a variety of strategies exist for evaluating BN methodology performance, ranging from utilizing artificial benchmark datasets and models, to specialized biological benchmark datasets, to simulation studies that generate synthetic data from predefined network models. The latter is arguably the most comprehensive approach; however, existing implementations are typically limited by their reliance on the SEM (structural equation modeling) framework, which includes many explicit and implicit assumptions that may be unrealistic in a typical biological data analysis scenario. In this study, we develop an alternative, purely probabilistic, simulation framework that more appropriately fits with real biological data and biological network models. In conjunction, we also expand on our current understanding of the theoretical notions of causality and dependence / conditional independence in BNs and the Markov Blankets within.


2013 ◽  
Vol 6 (3) ◽  
pp. 441-448 ◽  
Author(s):  
Sajid Nagi ◽  
Dhruba Kumar Bhattacharyya ◽  
Jugal K. Kalita

When clustering high dimensional data, traditional clustering methods are found to be lacking since they consider all of the dimensions of the dataset in discovering clusters whereas only some of the dimensions are relevant. This may give rise to subspaces within the dataset where clusters may be found. Using feature selection, we can remove irrelevant and redundant dimensions by analyzing the entire dataset. The problem of automatically identifying clusters that exist in multiple and maybe overlapping subspaces of high dimensional data, allowing better clustering of the data points, is known as Subspace Clustering. There are two major approaches to subspace clustering based on search strategy. Top-down algorithms find an initial clustering in the full set of dimensions and evaluate the subspaces of each cluster, iteratively improving the results. Bottom-up approaches start from finding low dimensional dense regions, and then use them to form clusters. Based on a survey on subspace clustering, we identify the challenges and issues involved with clustering gene expression data.


Sign in / Sign up

Export Citation Format

Share Document