Large-Scale Spectral Clustering Based on Representative Points

Spectral clustering (SC) has attracted more and more attention due to its effectiveness in machine learning. However, most traditional spectral clustering methods still face challenges in the successful application of large-scale spectral clustering problems mainly due to their high computational complexity οn3, where n is the number of samples. In order to achieve fast spectral clustering, we propose a novel approach, called representative point-based spectral clustering (RPSC), to efficiently deal with the large-scale spectral clustering problem. The proposed method first generates two-layer representative points successively by BKHK (balanced k-means-based hierarchical k-means). Then it constructs the hierarchical bipartite graph and performs spectral analysis on the graph. Specifically, we construct the similarity matrix using the parameter-free neighbor assignment method, which avoids the need to tune the extra parameters. Furthermore, we perform the coclustering on the final similarity matrix. The coclustering mechanism takes advantage of the cooccurring cluster structure among the representative points and the original data to strengthen the clustering performance. As a result, the computational complexity can be significantly reduced and the clustering accuracy can be improved. Extensive experiments on several large-scale data sets show the effectiveness, efficiency, and stability of the proposed method.

Download Full-text

A Novel Unsupervised Classification Method for Sandy Land Using Fully Polarimetric SAR Data

Remote Sensing ◽

10.3390/rs13030355 ◽

2021 ◽

Vol 13 (3) ◽

pp. 355

Author(s):

Weixian Tan ◽

Borong Sun ◽

Chenyu Xiao ◽

Pingping Huang ◽

Wei Xu ◽

...

Keyword(s):

Spectral Clustering ◽

Large Scale ◽

Clustering Algorithm ◽

Feature Vector ◽

Unsupervised Classification ◽

Classification Method ◽

Sandy Land ◽

Classification Methods ◽

The Many ◽

Representative Points

Classification based on polarimetric synthetic aperture radar (PolSAR) images is an emerging technology, and recent years have seen the introduction of various classification methods that have been proven to be effective to identify typical features of many terrain types. Among the many regions of the study, the Hunshandake Sandy Land in Inner Mongolia, China stands out for its vast area of sandy land, variety of ground objects, and intricate structure, with more irregular characteristics than conventional land cover. Accounting for the particular surface features of the Hunshandake Sandy Land, an unsupervised classification method based on new decomposition and large-scale spectral clustering with superpixels (ND-LSC) is proposed in this study. Firstly, the polarization scattering parameters are extracted through a new decomposition, rather than other decomposition approaches, which gives rise to more accurate feature vector estimate. Secondly, a large-scale spectral clustering is applied as appropriate to meet the massive land and complex terrain. More specifically, this involves a beginning sub-step of superpixels generation via the Adaptive Simple Linear Iterative Clustering (ASLIC) algorithm when the feature vector combined with the spatial coordinate information are employed as input, and subsequently a sub-step of representative points selection as well as bipartite graph formation, followed by the spectral clustering algorithm to complete the classification task. Finally, testing and analysis are conducted on the RADARSAT-2 fully PolSAR dataset acquired over the Hunshandake Sandy Land in 2016. Both qualitative and quantitative experiments compared with several classification methods are conducted to show that proposed method can significantly improve performance on classification.

Download Full-text

PARALLEL SPATIOTEMPORAL SPECTRAL CLUSTERING WITH MASSIVE TRAJECTORY DATA

ISPRS - International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences ◽

10.5194/isprs-archives-xlii-2-w7-1173-2017 ◽

2017 ◽

Vol XLII-2/W7 ◽

pp. 1173-1180 ◽

Cited By ~ 1

Author(s):

Y. Z. Gu ◽

K. Qin ◽

Y. X. Chen ◽

M. X. Yue ◽

T. Guo

Keyword(s):

Data Mining ◽

Computational Complexity ◽

Spectral Clustering ◽

Large Scale ◽

Trajectory Data ◽

Wuhan City ◽

Large Scale Problems ◽

Trajectory Data Mining ◽

Taxi Trajectory ◽

High Computational Complexity

Massive trajectory data contains wealth useful information and knowledge. Spectral clustering, which has been shown to be effective in finding clusters, becomes an important clustering approaches in the trajectory data mining. However, the traditional spectral clustering lacks the temporal expansion on the algorithm and limited in its applicability to large-scale problems due to its high computational complexity. This paper presents a parallel spatiotemporal spectral clustering based on multiple acceleration solutions to make the algorithm more effective and efficient, the performance is proved due to the experiment carried out on the massive taxi trajectory dataset in Wuhan city, China.

Download Full-text

Social Network Optimization for Cluster Ensemble Selection

Fundamenta Informaticae ◽

10.3233/fi-2020-1964 ◽

2020 ◽

Vol 176 (1) ◽

pp. 79-102

Author(s):

Chenyue Zhao ◽

Hosein Alizadeh ◽

Behrouz Minaei ◽

Majid Mohamadpoor ◽

Hamid Parvin ◽

...

Keyword(s):

Large Scale ◽

Cluster Structure ◽

Ensemble Methods ◽

Quadratic Program ◽

Maximization Problem ◽

Similarity Matrix ◽

Cluster Ensemble ◽

Ensemble Selection ◽

Consensus Functions ◽

Consensus Partition

This paper studies the cluster ensemble selection problem for unsupervised learning. Given a large ensemble of clustering solutions, our goal is to select a subset of solutions to form a smaller yet better performing cluster ensemble than using all available solutions. The common way of aggregating the chosen solutions is accumulating the information of the selected results to a similarity matrix. This paper suggests transforming the similarity matrix to a modularity matrix and then applying a new consensus function which optimizes modularity measure in it. We represent the modularity maximization problem as a 0-1 quadratic program which can be exactly solved for small datasets. We also established a new greedy algorithm, namely sum linkage, to optimize the objective function specially for large scale datasets in a very short time. We show that the proposed consensus partition gets much closer to the actual cluster structure than the partitions obtained from the direct application of common cluster ensemble methods. The promising results compared with other most cited consensus functions show the excellent efficiency of the proposed method.

Download Full-text

Spectral Clustering Algorithm: MATLAB PCT-Based Parallel Design and Implementation

Advanced Materials Research ◽

10.4028/www.scientific.net/amr.765-767.580 ◽

2013 ◽

Vol 765-767 ◽

pp. 580-584

Author(s):

Yu Yang ◽

Cheng Gui Zhao

Keyword(s):

Spectral Clustering ◽

Large Scale ◽

Clustering Algorithm ◽

Clustering Algorithms ◽

Parallel Structure ◽

Computational Time ◽

Similarity Matrix ◽

Data Intensive ◽

Clustering Quality ◽

Spectral Clustering Algorithm

Spectral clustering algorithms inevitable exist computational time and memory use problems for large-scale spectral clustering, owing to compute-intensive and data-intensive. We analyse the time complexity of constructing similarity matrix, doing eigendecomposition and performing k-means and exploiting SPMD parallel structure supported by MATLAB Parallel Computing Toolbox (PCT) to decrease eigendecomposition computational time. We propose using MATLAB Distributed Computing Server to parallel construct similarity matrix, whilst using t-nearest neighbors approach to reduce memory use. Ultimately, we present clustering time, clustering quality and clustering accuracy in the experiments.

Download Full-text

Approximate spectral clustering using both reference vectors and topology of the network generated by growing neural gas

PeerJ Computer Science ◽

10.7717/peerj-cs.679 ◽

2021 ◽

Vol 7 ◽

pp. e679

Author(s):

Kazuhisa Fujita

Keyword(s):

Spectral Clustering ◽

Laplacian Matrix ◽

Computational Time ◽

Similarity Matrix ◽

Clustering Methods ◽

Growing Neural Gas ◽

Memory Space ◽

Neural Gas ◽

Clustering Quality ◽

Data Points

Spectral clustering (SC) is one of the most popular clustering methods and often outperforms traditional clustering methods. SC uses the eigenvectors of a Laplacian matrix calculated from a similarity matrix of a dataset. SC has serious drawbacks: the significant increases in the time complexity derived from the computation of eigenvectors and the memory space complexity to store the similarity matrix. To address the issues, I develop a new approximate spectral clustering using the network generated by growing neural gas (GNG), called ASC with GNG in this study. ASC with GNG uses not only reference vectors for vector quantization but also the topology of the network for extraction of the topological relationship between data points in a dataset. ASC with GNG calculates the similarity matrix from both the reference vectors and the topology of the network generated by GNG. Using the network generated from a dataset by GNG, ASC with GNG achieves to reduce the computational and space complexities and improve clustering quality. In this study, I demonstrate that ASC with GNG effectively reduces the computational time. Moreover, this study shows that ASC with GNG provides equal to or better clustering performance than SC.

Download Full-text

Multi-View Spectral Clustering Based on Multi-Smooth Representation Fusion for Cancer Subtype Prediction

Frontiers in Genetics ◽

10.3389/fgene.2021.718915 ◽

2021 ◽

Vol 12 ◽

Author(s):

Jian Liu ◽

Shuguang Ge ◽

Yuhu Cheng ◽

Xuesong Wang

Keyword(s):

Spectral Clustering ◽

Cluster Structure ◽

Biological Data ◽

Data Matrix ◽

Omics Data ◽

Similarity Matrix ◽

Graph Learning ◽

Smooth Representation ◽

Cancer Subtypes ◽

Cancer Data

It is a vital task to design an integrated machine learning model to discover cancer subtypes and understand the heterogeneity of cancer based on multiple omics data. In recent years, some multi-view clustering algorithms have been proposed and applied to the prediction of cancer subtypes. Among them, the multi-view clustering methods based on graph learning are widely concerned. These multi-view approaches usually have one or more of the following problems. Many multi-view algorithms use the original omics data matrix to construct the similarity matrix and ignore the learning of the similarity matrix. They separate the data clustering process from the graph learning process, resulting in a highly dependent clustering performance on the predefined graph. In the process of graph fusion, these methods simply take the average value of the affinity graph of multiple views to represent the result of the fusion graph, and the rich heterogeneous information is not fully utilized. To solve the above problems, in this paper, a Multi-view Spectral Clustering Based on Multi-smooth Representation Fusion (MRF-MSC) method was proposed. Firstly, MRF-MSC constructs a smooth representation for each data type, which can be viewed as a sample (patient) similarity matrix. The smooth representation can explicitly enhance the grouping effect. Secondly, MRF-MSC integrates the smooth representation of multiple omics data to form a similarity matrix containing all biological data information through graph fusion. In addition, MRF-MSC adaptively gives weight factors to the smooth regularization representation of each omics data by using the self-weighting method. Finally, MRF-MSC imposes constrained Laplacian rank on the fusion similarity matrix to get a better cluster structure. The above problems can be transformed into spectral clustering for solving, and the clustering results can be obtained. MRF-MSC unifies the above process of graph construction, graph fusion and spectral clustering under one framework, which can learn better data representation and high-quality graphs, so as to achieve better clustering effect. In the experiment, MRF-MSC obtained good experimental results on the TCGA cancer data sets.

Download Full-text

Tracklet Self-Supervised Learning for Unsupervised Person Re-Identification

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i07.6921 ◽

2020 ◽

Vol 34 (07) ◽

pp. 12362-12369 ◽

Cited By ~ 1

Author(s):

Guile Wu ◽

Xiatian Zhu ◽

Shaogang Gong

Keyword(s):

Unsupervised Learning ◽

Supervised Learning ◽

Large Scale ◽

Domain Adaptation ◽

Cluster Structure ◽

Alternative Methods ◽

Image Clustering ◽

Clustering Methods ◽

Data Annotation ◽

Cross Domain

Existing unsupervised person re-identification (re-id) methods mainly focus on cross-domain adaptation or one-shot learning. Although they are more scalable than the supervised learning counterparts, relying on a relevant labelled source domain or one labelled tracklet per person initialisation still restricts their scalability in real-world deployments. To alleviate these problems, some recent studies develop unsupervised tracklet association and bottom-up image clustering methods, but they still rely on explicit camera annotation or merely utilise suboptimal global clustering. In this work, we formulate a novel tracklet self-supervised learning (TSSL) method, which is capable of capitalising directly from abundant unlabelled tracklet data, to optimise a feature embedding space for both video and image unsupervised re-id. This is achieved by designing a comprehensive unsupervised learning objective that accounts for tracklet frame coherence, tracklet neighbourhood compactness, and tracklet cluster structure in a unified formulation. As a pure unsupervised learning re-id model, TSSL is end-to-end trainable at the absence of source data annotation, person identity labels, and camera prior knowledge. Extensive experiments demonstrate the superiority of TSSL over a wide variety of the state-of-the-art alternative methods on four large-scale person re-id benchmarks, including Market-1501, DukeMTMC-ReID, MARS and DukeMTMC-VideoReID.

Download Full-text

Survey of Clustering Methods for Large Scale Dataset

International Journal of Computer Sciences and Engineering ◽

10.26438/ijcse/v7i5.13381344 ◽

2019 ◽

Vol 7 (5) ◽

pp. 1338-1344

Author(s):

Anupama Jawale ◽

Ganesh Magar

Keyword(s):

Large Scale ◽

Clustering Methods ◽

Large Scale Dataset

Download Full-text

A Weighted Kernel PCA Formulation with Out-of-Sample Extensions for Spectral Clustering Methods

The 2006 IEEE International Joint Conference on Neural Network Proceedings ◽

10.1109/ijcnn.2006.1716082 ◽

2006 ◽

Cited By ~ 4

Author(s):

C. Alzate ◽

J.A.K. Suykens

Keyword(s):

Spectral Clustering ◽

Kernel Pca ◽

Clustering Methods ◽

Out Of Sample ◽

Weighted Kernel

Download Full-text

Discovering Distinct Patterns in Gene Expression Profiles

Journal of Integrative Bioinformatics ◽

10.1515/jib-2008-105 ◽

2008 ◽

Vol 5 (2) ◽

Cited By ~ 1

Author(s):

Li Teng ◽

Laiwan Chan

Keyword(s):

Gene Expression ◽

Large Scale ◽

Expression Profiles ◽

Expression Patterns ◽

Gene Expression Profiles ◽

Clustering Methods ◽

Gene Expressions ◽

Real Gene ◽

Large Scale Dataset ◽

Coexpressed Genes

SummaryTraditional analysis of gene expression profiles use clustering to find groups of coexpressed genes which have similar expression patterns. However clustering is time consuming and could be diffcult for very large scale dataset. We proposed the idea of Discovering Distinct Patterns (DDP) in gene expression profiles. Since patterns showing by the gene expressions reveal their regulate mechanisms. It is significant to find all different patterns existing in the dataset when there is little prior knowledge. It is also a helpful start before taking on further analysis. We propose an algorithm for DDP by iteratively picking out pairs of gene expression patterns which have the largest dissimilarities. This method can also be used as preprocessing to initialize centers for clustering methods, like K-means. Experiments on both synthetic dataset and real gene expression datasets show our method is very effective in finding distinct patterns which have gene functional significance and is also effcient.

Download Full-text