Spectral Nonlinearly Embedded Clustering Algorithm

As is well known, traditional spectral clustering (SC) methods are developed based on themanifold assumption, namely, that two nearby data points in the high-density region of a low-dimensional data manifold have the same cluster label. But, for some high-dimensional and sparse data, such an assumption might be invalid. Consequently, the clustering performance of SC will be degraded sharply in this case. To solve this problem, in this paper, we propose a general spectral embedded framework, which embeds the true cluster assignment matrix for high-dimensional data into a nonlinear space by a predefined embedding function. Based on this framework, several algorithms are presented by using different embedding functions, which aim at learning the final cluster assignment matrix and a transformation into a low dimensionality space simultaneously. More importantly, the proposed method can naturally handle the out-of-sample extension problem. The experimental results on benchmark datasets demonstrate that the proposed method significantly outperforms existing clustering methods.

Download Full-text

A Novel High-Dimensional Trajectories Construction Network based on Multi-Clustering Algorithm

10.21203/rs.3.rs-1060086/v1 ◽

2021 ◽

Author(s):

Feiyang Ren ◽

Yi Han ◽

Shaohan Wang ◽

He Jiang

Keyword(s):

Economic Analysis ◽

Clustering Algorithm ◽

Transportation Network ◽

High Dimensional ◽

Clustering Methods ◽

Marine Transportation ◽

Network Construction ◽

National Economic ◽

Multi Level ◽

State Of Art

Abstract A novel marine transportation network based on high-dimensional AIS data with a multi-level clustering algorithm is proposed to discover important waypoints in trajectories based on selected navigation features. This network contains two parts: the calculation of major nodes with CLIQUE and BIRCH clustering methods and navigation network construction with edge construction theory. Unlike the state-of-art work for navigation clustering with only ship coordinate, the proposed method contains more high-dimensional features such as drafting, weather, and fuel consumption. By comparing the historical AIS data, more than 220,133 lines of data in 30 days were used to extract 440 major nodal points in less than 4 minutes with ordinary PC specs (i5 processer). The proposed method can be performed on more dimensional data for better ship path planning or even national economic analysis. Current work has shown good performance on complex ship trajectories distinction and great potential for future shipping transportation market analytical predictions.

Download Full-text

An Incremental Version of L-MVU for the Feature Extraction of MI-EEG

Computational Intelligence and Neuroscience ◽

10.1155/2019/4317078 ◽

2019 ◽

Vol 2019 ◽

pp. 1-19

Author(s):

Mingai Li ◽

Hongwei Xi ◽

Xiaoqing Zhu

Keyword(s):

Feature Extraction ◽

Extraction Method ◽

Average Energy ◽

High Dimensional ◽

Dimensional Representation ◽

Feature Extraction Method ◽

Sample Data ◽

Out Of Sample ◽

Nonlinear Features ◽

Low Dimensional

Due to the nonlinear and high-dimensional characteristics of motor imagery electroencephalography (MI-EEG), it can be challenging to get high online accuracy. As a nonlinear dimension reduction method, landmark maximum variance unfolding (L-MVU) can completely retain the nonlinear features of MI-EEG. However, L-MVU still requires considerable computation costs for out-of-sample data. An incremental version of L-MVU (denoted as IL-MVU) is proposed in this paper. The low-dimensional representation of the training data is generated by L-MVU. For each out-of-sample data, its nearest neighbors will be found in the high-dimensional training samples and the corresponding reconstruction weight matrix be calculated to generate its low-dimensional representation as well. IL-MVU is further combined with the dual-tree complex wavelet transform (DTCWT), which develops a hybrid feature extraction method (named as IL-MD). IL-MVU is applied to extract the nonlinear features of the specific subband signals, which are reconstructed by DTCWT and have the obvious event-related synchronization/event-related desynchronization phenomenon. The average energy features of α and β waves are calculated simultaneously. The two types of features are fused and are evaluated by a linear discriminant analysis classifier. Based on the two public datasets with 12 subjects, extensive experiments were conducted. The average recognition accuracies of 10-fold cross-validation are 92.50% on Dataset 3b and 88.13% on Dataset 2b, and they gain at least 1.43% and 3.45% improvement, respectively, compared to existing methods. The experimental results show that IL-MD can extract more accurate features with relatively lower consumption cost, and it also has better feature visualization and self-adaptive characteristics to subjects. The t-test results and Kappa values suggest the proposed feature extraction method reaches statistical significance and has high consistency in classification.

Download Full-text

Out-of-sample data visualization using bi-kernel t-SNE

Information Visualization ◽

10.1177/1473871620978209 ◽

2020 ◽

pp. 147387162097820

Author(s):

Haili Zhang ◽

Pu Wang ◽

Xuejin Gao ◽

Yongsheng Qi ◽

Huihui Gao

Keyword(s):

Data Visualization ◽

Principal Component ◽

Feature Space ◽

Gaussian Kernel ◽

Sample Data ◽

Out Of Sample ◽

Benchmark Datasets ◽

The Difference ◽

Low Dimensional ◽

Effective Visualization

T-distributed stochastic neighbor embedding (t-SNE) is an effective visualization method. However, it is non-parametric and cannot be applied to steaming data or online scenarios. Although kernel t-SNE provides an explicit projection from a high-dimensional data space to a low-dimensional feature space, some outliers are not well projected. In this paper, bi-kernel t-SNE is proposed for out-of-sample data visualization. Gaussian kernel matrices of the input and feature spaces are used to approximate the explicit projection. Then principal component analysis is applied to reduce the dimensionality of the feature kernel matrix. Thus, the difference between inliers and outliers is revealed. And any new sample can be well mapped. The performance of the proposed method for out-of-sample projection is tested on several benchmark datasets by comparing it with other state-of-the-art algorithms.

Download Full-text

A Preview on Subspace Clustering of High Dimensional Data

INTERNATIONAL JOURNAL OF COMPUTERS & TECHNOLOGY ◽

10.24297/ijct.v6i3.4466 ◽

2013 ◽

Vol 6 (3) ◽

pp. 441-448 ◽

Cited By ~ 1

Author(s):

Sajid Nagi ◽

Dhruba Kumar Bhattacharyya ◽

Jugal K. Kalita

Keyword(s):

Search Strategy ◽

High Dimensional Data ◽

Subspace Clustering ◽

High Dimensional ◽

Expression Data ◽

Clustering Methods ◽

Top Down ◽

Data Points ◽

Low Dimensional ◽

Entire Dataset

When clustering high dimensional data, traditional clustering methods are found to be lacking since they consider all of the dimensions of the dataset in discovering clusters whereas only some of the dimensions are relevant. This may give rise to subspaces within the dataset where clusters may be found. Using feature selection, we can remove irrelevant and redundant dimensions by analyzing the entire dataset. The problem of automatically identifying clusters that exist in multiple and maybe overlapping subspaces of high dimensional data, allowing better clustering of the data points, is known as Subspace Clustering. There are two major approaches to subspace clustering based on search strategy. Top-down algorithms find an initial clustering in the full set of dimensions and evaluate the subspaces of each cluster, iteratively improving the results. Bottom-up approaches start from finding low dimensional dense regions, and then use them to form clusters. Based on a survey on subspace clustering, we identify the challenges and issues involved with clustering gene expression data.

Download Full-text

Dynamic Factor Copula Models with Estimated Cluster Assignments

Finance and Economics Discussion Series ◽

10.17016/feds.2021.029 ◽

2021 ◽

Vol 2021 (026) ◽

pp. 1-52

Author(s):

Dong Hwan Oh ◽

◽

Andrew J. Patton ◽

Keyword(s):

Time Series ◽

Clustering Algorithm ◽

Ad Hoc ◽

High Dimensional ◽

Dynamic Factor ◽

Finite Sample ◽

Copula Models ◽

Finite Sample Properties ◽

Out Of Sample ◽

Individual Variables

This paper proposes a dynamic multi-factor copula for use in high dimensional time series applications. A novel feature of our model is that the assignment of individual variables to groups is estimated from the data, rather than being pre-assigned using SIC industry codes, market capitalization ranks, or other ad hoc methods. We adapt the k-means clustering algorithm for use in our application and show that it has excellent finite-sample properties. Applying the new model to returns on 110 US equities, we find around 20 clusters to be optimal. In out-of-sample forecasts, we find that a model with as few as five estimated clusters significantly outperforms an otherwise identical model with 21 clusters formed using two-digit SIC codes.

Download Full-text

M-Denclue for Effective Data Clustering in High Dimensional Non-Linear Data

International Journal of Innovative Technology and Exploring Engineering - Special Issue ◽

10.35940/ijitee.a9109.119119 ◽

2019 ◽

Vol 9 (1) ◽

pp. 2925-2927

Keyword(s):

Clustering Algorithms ◽

High Dimensional Data ◽

Research Work ◽

Curse Of Dimensionality ◽

Distance Measures ◽

High Dimensional ◽

Clustering Methods ◽

Non Linear ◽

Low Dimensional ◽

Automatic Grouping

Clustering is a data mining task devoted to the automatic grouping of data based on mutual similarity. Clustering in high-dimensional spaces is a recurrent problem in many domains. It affects time complexity, space complexity, scalability and accuracy of clustering methods. Highdimensional non-linear datausually live in different low dimensional subspaces hidden in the original space. As high‐dimensional objects appear almost alike, new approaches for clustering are required. This research has focused on developing Mathematical models, techniques and clustering algorithms specifically for high‐dimensional data. The innocent growth in the fields of communication and technology, there is tremendous growth in high dimensional data spaces. As the variant of dimensions on high dimensional non-linear data increases, many clustering techniques begin to suffer from the curse of dimensionality, de-grading the quality of the results. In high dimensional non-linear data, the data becomes very sparse and distance measures become increasingly meaningless. The principal challenge for clustering high dimensional data is to overcome the “curse of dimensionality”. This research work concentrates on devising an enhanced algorithm for clustering high dimensional non-linear data.

Download Full-text

Techniques for Weighted Clustering Ensembles

Encyclopedia of Data Warehousing and Mining, Second Edition ◽

10.4018/978-1-60566-010-3.ch293 ◽

2011 ◽

pp. 1916-1922

Author(s):

Carlotta Domeniconi

Keyword(s):

Feature Selection ◽

Clustering Algorithm ◽

Selection Procedure ◽

Subspace Clustering ◽

Distance Measures ◽

High Dimensional ◽

Clustering Methods ◽

Input Space ◽

Cluster Ensembles ◽

Local Correlations

In an effort to achieve improved classifier accuracy, extensive research has been conducted in classifier ensembles. Very recently, cluster ensembles have emerged. It is well known that off-the-shelf clustering methods may discover different structures in a given set of data. This is because each clustering algorithm has its own bias resulting from the optimization of different criteria. Furthermore, there is no ground truth against which the clustering result can be validated. Thus, no cross-validation technique can be carried out to tune input parameters involved in the clustering process. As a consequence, the user is not equipped with any guidelines for choosing the proper clustering method for a given dataset. Cluster ensembles offer a solution to challenges inherent to clustering arising from its ill-posed nature. Cluster ensembles can provide more robust and stable solutions by leveraging the consensus across multiple clustering results, while averaging out emergent spurious structures that arise due to the various biases to which each participating algorithm is tuned. In this chapter, we discuss the problem of combining multiple weighted clusters, discovered by a locally adaptive algorithm (Domeniconi, Papadopoulos, Gunopulos, & Ma, 2004) which detects clusters in different subspaces of the input space. We believe that our approach is the first attempt to design a cluster ensemble for subspace clustering (Al-Razgan & Domeniconi, 2006). Recently, several subspace clustering methods have been proposed (Parsons, Haque, & Liu, 2004). They all attempt to dodge the curse of dimensionality which affects any algorithm in high dimensional spaces. In high dimensional spaces, it is highly likely that, for any given pair of points within the same cluster, there exist at least a few dimensions on which the points are far apart from each other. As a consequence, distance functions that equally use all input features may not be effective. Furthermore, several clusters may exist in different subspaces comprised of different combinations of features. In many real-world problems, some points are correlated with respect to a given set of dimensions, while others are correlated with respect to different dimensions. Each dimension could be relevant to at least one of the clusters. Global dimensionality reduction techniques are unable to capture local correlations of data. Thus, a proper feature selection procedure should operate locally in input space. Local feature selection allows one to embed different distance measures in different regions of the input space; such distance metrics reflect local correlations of data. In (Domeniconi, Papadopoulos, Gunopulos, & Ma, 2004) we proposed a soft feature selection procedure (called LAC) that assigns weights to features according to the local correlations of data along each dimension. Dimensions along which data are loosely correlated receive a small weight, which has the effect of elongating distances along that dimension. Features along which data are strongly correlated receive a large weight, which has the effect of constricting distances along that dimension. Thus the learned weights perform a directional local reshaping of distances which allows a better separation of clusters, and therefore the discovery of different patterns in different subspaces of the original input space.

Download Full-text

Smooth Splicing: A Robust SNN-Based Method for Clustering High-Dimensional Data

Mathematical Problems in Engineering ◽

10.1155/2013/295067 ◽

2013 ◽

Vol 2013 ◽

pp. 1-9 ◽

Cited By ~ 2

Author(s):

JingDong Tan ◽

RuJing Wang

Keyword(s):

Graph Theory ◽

Clustering Algorithm ◽

Nearest Neighbor ◽

High Dimensional Data ◽

High Dimensional ◽

Clustering Methods ◽

Single Edge ◽

Density Based Clustering ◽

Measure Of Similarity

Sharing nearest neighbor (SNN) is a novel metric measure of similarity, and it can conquer two hardships: the low similarities between samples and the different densities of classes. At present, there are two popular SNN similarity based clustering methods: JP clustering and SNN density based clustering. Their clustering results highly rely on the weighting value of the single edge, and thus they are very vulnerable. Motivated by the idea of smooth splicing in computing geometry, the authors design a novel SNN similarity based clustering algorithm within the structure of graph theory. Since it inherits complementary intensity-smoothness principle, its generalizing ability surpasses those of the previously mentioned two methods. The experiments on text datasets show its effectiveness.

Download Full-text

A new Kmeans clustering model and its generalization achieved by joint spectral embedding and rotation

PeerJ Computer Science ◽

10.7717/peerj-cs.450 ◽

2021 ◽

Vol 7 ◽

pp. e450

Author(s):

Wenna Huang ◽

Yong Peng ◽

Yuan Ge ◽

Wanzeng Kong

Keyword(s):

Spectral Clustering ◽

Similarity Measures ◽

Optimization Method ◽

Similar Data ◽

Clustering Methods ◽

Cluster Assignment ◽

Clustering Model ◽

Data Similarity ◽

Spectral Embedding ◽

Benchmark Datasets

The Kmeans clustering and spectral clustering are two popular clustering methods for grouping similar data points together according to their similarities. However, the performance of Kmeans clustering might be quite unstable due to the random initialization of the cluster centroids. Generally, spectral clustering methods employ a two-step strategy of spectral embedding and discretization postprocessing to obtain the cluster assignment, which easily lead to far deviation from true discrete solution during the postprocessing process. In this paper, based on the connection between the Kmeans clustering and spectral clustering, we propose a new Kmeans formulation by joint spectral embedding and spectral rotation which is an effective postprocessing approach to perform the discretization, termed KMSR. Further, instead of directly using the dot-product data similarity measure, we make generalization on KMSR by incorporating more advanced data similarity measures and call this generalized model as KMSR-G. An efficient optimization method is derived to solve the KMSR (KMSR-G) model objective whose complexity and convergence are provided. We conduct experiments on extensive benchmark datasets to validate the performance of our proposed models and the experimental results demonstrate that our models perform better than the related methods in most cases.

Download Full-text

Robust subspace clustering based on latent low rank representation with non-negative sparse Laplacian constraints

Journal of Intelligent & Fuzzy Systems ◽

10.3233/jifs-210274 ◽

2021 ◽

pp. 1-15

Author(s):

Zhixuan xu ◽

Caikou Chen ◽

Guojiang Han ◽

Jun Gao

Keyword(s):

State Of The Art ◽

Subspace Clustering ◽

Dimensional Subspace ◽

Low Rank ◽

Dimensional Structure ◽

Clustering Methods ◽

Benchmark Datasets ◽

Handwritten Digit ◽

Low Rank Representation ◽

Low Dimensional

As a successful improvement on Low Rank Representation (LRR), Latent Low Rank Representation (LatLRR) has been one of the state-of-the-art models for subspace clustering due to the capability of discovering the low dimensional subspace structures of data, especially when the data samples are insufficient and/or extremely corrupted. However, the LatLRR method does not consider the nonlinear geometric structures within data, which leads to the loss of the locality information among data in the learning phase. Moreover, the coefficients of the learnt representation matrix can be negative, which lack the interpretability. To solve the above drawbacks of LatLRR, this paper introduces Laplacian, sparsity and non-negativity to LatLRR model and proposes a novel subspace clustering method, termed latent low rank representation with non-negative, sparse and laplacian constraints (NNSLLatLRR), in which we jointly take into account non-negativity, sparsity and laplacian properties of the learnt representation. As a result, the NNSLLatLRR can not only capture the global low dimensional structure and intrinsic non-linear geometric information of the data, but also enhance the interpretability of the learnt representation. Extensive experiments on two face benchmark datasets and a handwritten digit dataset show that our proposed method outperforms existing state-of-the-art subspace clustering methods.

Download Full-text