Deep Clustering with Self-supervision using Pairwise Data Similarities

10.36227/techrxiv.14852652.v1 ◽

2021 ◽

Author(s):

Mohammadreza Sadeghi ◽

Naeges Armanfard

Keyword(s):

Dimensional Space ◽

Second Phase ◽

Similar Data ◽

Clustering Methods ◽

Latent Space ◽

Benchmark Datasets ◽

Data Points ◽

Common Group ◽

Fully Connected ◽

Group Center

Deep clustering incorporates embedding into clustering to find a lower-dimensional space appropriate for clustering. Most of the existing methods try to group similar data points through simultaneously minimizing clustering and reconstruction losses, employing an autoencoder (AE). However, they all ignore the relevant useful information available within pairwise data relationships. In this paper we propose a novel deep clustering framework with self-supervision using pairwise data similarities (DCSS). The proposed method consists of two successive phases. First, we propose a novel AE-based approach that aims to aggregate similar data points near a common group center in the latent space of an AE. The AE's latent space is obtained by minimizing weighted reconstruction and centering losses of data points, where weights are defined based on similarity of data points and group centers. In the second phase, we map the AE's latent space, using a fully connected network MNet, onto a K-dimensional space used to derive the final data cluster assignments, where K is the number of clusters. MNet is trained to strengthen (weaken) similarity of similar (dissimilar) samples. Experimental results on multiple benchmark datasets demonstrate the effectiveness of DCSS for data clustering and as a general framework for boosting up state-of-the-art clustering methods.

Download Full-text

Deep Clustering with Self-supervision using Pairwise Data Similarities

10.36227/techrxiv.14852652.v2 ◽

2021 ◽

Author(s):

Mohammadreza Sadeghi ◽

Narges Armanfard

Keyword(s):

Dimensional Space ◽

Second Phase ◽

Similar Data ◽

Number Of Clusters ◽

Latent Space ◽

Benchmark Datasets ◽

Data Points ◽

Complex Cluster ◽

Lower Dimensional Space ◽

Lower Dimensional

<div>Deep clustering incorporates embedding into clustering to find a lower-dimensional space appropriate for clustering. In this paper we propose a novel deep clustering framework with self-supervision using pairwise data similarities (DCSS). The proposed method consists of two successive phases. In the first phase we propose to form hypersphere-like groups of similar data points, i.e. one hypersphere per cluster, employing an autoencoder which is trained using cluster-specific losses. The hyper-spheres are formed in the autoencoder’s latent space. In the second phase, we propose to employ pairwise data similarities to create a K-dimensional space that is capable of accommodating more complex cluster distributions; hence, providing more accurate clustering performance. K is the number of clusters. The autoencoder’s latent space obtained in the first phase is used as the input of the second phase. Effectiveness of both phases are demonstrated on seven benchmark datasets through conducting a rigorous set of experiments.</div>

Download Full-text

A new Kmeans clustering model and its generalization achieved by joint spectral embedding and rotation

PeerJ Computer Science ◽

10.7717/peerj-cs.450 ◽

2021 ◽

Vol 7 ◽

pp. e450

Author(s):

Wenna Huang ◽

Yong Peng ◽

Yuan Ge ◽

Wanzeng Kong

Keyword(s):

Spectral Clustering ◽

Similarity Measures ◽

Optimization Method ◽

Similar Data ◽

Clustering Methods ◽

Cluster Assignment ◽

Clustering Model ◽

Data Similarity ◽

Spectral Embedding ◽

Benchmark Datasets

The Kmeans clustering and spectral clustering are two popular clustering methods for grouping similar data points together according to their similarities. However, the performance of Kmeans clustering might be quite unstable due to the random initialization of the cluster centroids. Generally, spectral clustering methods employ a two-step strategy of spectral embedding and discretization postprocessing to obtain the cluster assignment, which easily lead to far deviation from true discrete solution during the postprocessing process. In this paper, based on the connection between the Kmeans clustering and spectral clustering, we propose a new Kmeans formulation by joint spectral embedding and spectral rotation which is an effective postprocessing approach to perform the discretization, termed KMSR. Further, instead of directly using the dot-product data similarity measure, we make generalization on KMSR by incorporating more advanced data similarity measures and call this generalized model as KMSR-G. An efficient optimization method is derived to solve the KMSR (KMSR-G) model objective whose complexity and convergence are provided. We conduct experiments on extensive benchmark datasets to validate the performance of our proposed models and the experimental results demonstrate that our models perform better than the related methods in most cases.

Download Full-text

Conventional displays of structures in data compared with interactive projection-based clustering (IPBC)

International Journal of Data Science and Analytics ◽

10.1007/s41060-021-00264-2 ◽

2021 ◽

Author(s):

Michael C. Thrun ◽

Felix Pape ◽

Alfred Ultsch

Keyword(s):

Visual Analytics ◽

Clustering Algorithms ◽

Empirical Evaluation ◽

Similar Data ◽

Qualitative Comparison ◽

Domain Experts ◽

Human In The Loop ◽

3D Displays ◽

Benchmark Datasets ◽

Data Points

AbstractClustering is an important task in knowledge discovery with the goal to identify structures of similar data points in a dataset. Here, the focus lies on methods that use a human-in-the-loop, i.e., incorporate user decisions into the clustering process through 2D and 3D displays of the structures in the data. Some of these interactive approaches fall into the category of visual analytics and emphasize the power of such displays to identify the structures interactively in various types of datasets or to verify the results of clustering algorithms. This work presents a new method called interactive projection-based clustering (IPBC). IPBC is an open-source and parameter-free method using a human-in-the-loop for an interactive 2.5D display and identification of structures in data based on the user’s choice of a dimensionality reduction method. The IPBC approach is systematically compared with accessible visual analytics methods for the display and identification of cluster structures using twelve clustering benchmark datasets and one additional natural dataset. Qualitative comparison of 2D, 2.5D and 3D displays of structures and empirical evaluation of the identified cluster structures show that IPBC outperforms comparable methods. Additionally, IPBC assists in identifying structures previously unknown to domain experts in an application.

Download Full-text

A comprehensive survey of image segmentation: clustering methods, performance parameters, and benchmark datasets

Multimedia Tools and Applications ◽

10.1007/s11042-021-10594-9 ◽

2021 ◽

Author(s):

Himanshu Mittal ◽

Avinash Chandra Pandey ◽

Mukesh Saraswat ◽

Sumit Kumar ◽

Raju Pal ◽

...

Keyword(s):

Image Segmentation ◽

Performance Parameters ◽

Clustering Methods ◽

Benchmark Datasets ◽

Comprehensive Survey

Download Full-text

A flexible framework for anomaly Detection via dimensionality reduction

Neural Computing and Applications ◽

10.1007/s00521-021-05839-5 ◽

2021 ◽

Author(s):

Alireza Vafaei Sadr ◽

Bruce A. Bassett ◽

M. Kunz

Keyword(s):

Anomaly Detection ◽

Dimensionality Reduction ◽

Dimensional Space ◽

High Dimensions ◽

Detection Algorithms ◽

Latent Space ◽

Wide Range ◽

Flexible Framework ◽

Online Anomaly Detection ◽

Python Package

AbstractAnomaly detection is challenging, especially for large datasets in high dimensions. Here, we explore a general anomaly detection framework based on dimensionality reduction and unsupervised clustering. DRAMA is released as a general python package that implements the general framework with a wide range of built-in options. This approach identifies the primary prototypes in the data with anomalies detected by their large distances from the prototypes, either in the latent space or in the original, high-dimensional space. DRAMA is tested on a wide variety of simulated and real datasets, in up to 3000 dimensions, and is found to be robust and highly competitive with commonly used anomaly detection algorithms, especially in high dimensions. The flexibility of the DRAMA framework allows for significant optimization once some examples of anomalies are available, making it ideal for online anomaly detection, active learning, and highly unbalanced datasets. Besides, DRAMA naturally provides clustering of outliers for subsequent analysis.

Download Full-text

Semi-Supervised Deep Learning for High-Dimensional Uncertainty Quantification

Volume 11A: 46th Design Automation Conference (DAC) ◽

10.1115/detc2020-22204 ◽

2020 ◽

Author(s):

Zequn Wang ◽

Mingyang Li

Keyword(s):

Uncertainty Quantification ◽

Reliability Analysis ◽

Supervised Learning ◽

Dimensional Space ◽

Limit State ◽

Failure Surface ◽

Simulation Method ◽

High Dimensional ◽

State Function ◽

Latent Space

Abstract Conventional uncertainty quantification methods usually lacks the capability of dealing with high-dimensional problems due to the curse of dimensionality. This paper presents a semi-supervised learning framework for dimension reduction and reliability analysis. An autoencoder is first adopted for mapping the high-dimensional space into a low-dimensional latent space, which contains a distinguishable failure surface. Then a deep feedforward neural network (DFN) is utilized to learn the mapping relationship and reconstruct the latent space, while the Gaussian process (GP) modeling technique is used to build the surrogate model of the transformed limit state function. During the training process of the DFN, the discrepancy between the actual and reconstructed latent space is minimized through semi-supervised learning for ensuring the accuracy. Both labeled and unlabeled samples are utilized for defining the loss function of the DFN. Evolutionary algorithm is adopted to train the DFN, then the Monte Carlo simulation method is used for uncertainty quantification and reliability analysis based on the proposed framework. The effectiveness is demonstrated through a mathematical example.

Download Full-text

Introducing A Hybrid Data Mining Model to Evaluate Customer Loyalty

Engineering, Technology & Applied Science Research ◽

10.48084/etasr.741 ◽

2016 ◽

Vol 6 (6) ◽

pp. 1235-1240

Author(s):

H. Alizadeh ◽

B. Minaei Bidgoli

Keyword(s):

Customer Loyalty ◽

Accurate Method ◽

Support Vector ◽

Similar Data ◽

Classification Methods ◽

Clustering Methods ◽

K Nearest Neighbors ◽

Step Model ◽

Customer Classification ◽

Loyal Customers

The main aim of this study was introducing a comprehensive model of bank customers᾽ loyalty evaluation based on the assessment and comparison of different clustering methods᾽ performance. This study also pursues the following specific objectives: a) using different clustering methods and comparing them for customer classification, b) finding the effective variables in determining the customer loyalty, and c) using different collective classification methods to increase the modeling accuracy and comparing the results with the basic methods. Since loyal customers generate more profit, this study aims at introducing a two-step model for classification of customers and their loyalty. For this purpose, various methods of clustering such as K-medoids, X-means and K-means were used, the last of which outperformed the other two through comparing with Davis-Bouldin index. Customers were clustered by using K-means and members of these four clusters were analyzed and labeled. Then, a predictive model was run based on demographic variables of customers using various classification methods such as DT (Decision Tree), ANN (Artificial Neural Networks), NB (Naive Bayes), KNN (K-Nearest Neighbors) and SVM (Support Vector Machine), as well as their bagging and boosting to predict the class of loyal customers. The results showed that the bagging-ANN was the most accurate method in predicting loyal customers. This two-stage model can be used in banks and financial institutions with similar data to identify the type of future customers.

Download Full-text

Projected Clustering for Biological Data Analysis

Encyclopedia of Data Warehousing and Mining, Second Edition ◽

10.4018/978-1-60566-010-3.ch247 ◽

2011 ◽

pp. 1617-1622

Author(s):

Ping Deng ◽

Qingkai Ma ◽

Weili Wu

Keyword(s):

Nearest Neighbor ◽

Dimensional Space ◽

Clustering Algorithms ◽

Biological Data ◽

High Dimensional ◽

Projected Clustering ◽

Cluster Data ◽

Biological Data Analysis ◽

Data Points ◽

Entire Dataset

Clustering can be considered as the most important unsupervised learning problem. It has been discussed thoroughly by both statistics and database communities due to its numerous applications in problems such as classification, machine learning, and data mining. A summary of clustering techniques can be found in (Berkhin, 2002). Most known clustering algorithms such as DBSCAN (Easter, Kriegel, Sander, & Xu, 1996) and CURE (Guha, Rastogi, & Shim, 1998) cluster data points based on full dimensions. When the dimensional space grows higher, the above algorithms lose their efficiency and accuracy because of the so-called “curse of dimensionality”. It is shown in (Beyer, Goldstein, Ramakrishnan, & Shaft, 1999) that computing the distance based on full dimensions is not meaningful in high dimensional space since the distance of a point to its nearest neighbor approaches the distance to its farthest neighbor as dimensionality increases. Actually, natural clusters might exist in subspaces. Data points in different clusters may be correlated with respect to different subsets of dimensions. In order to solve this problem, feature selection (Kohavi & Sommerfield, 1995) and dimension reduction (Raymer, Punch, Goodman, Kuhn, & Jain, 2000) have been proposed to find the closely correlated dimensions for all the data and the clusters in such dimensions. Although both methods reduce the dimensionality of the space before clustering, the case where clusters may exist in different subspaces of full dimensions is not handled well. Projected clustering has been proposed recently to effectively deal with high dimensionalities. Finding clusters and their relevant dimensions are the objectives of projected clustering algorithms. Instead of projecting the entire dataset on the same subspace, projected clustering focuses on finding specific projection for each cluster such that the similarity is reserved as much as possible.

Download Full-text

Learning Manifolds

Machine Learning in Computer-Aided Diagnosis - Advances in Bioinformatics and Biomedical Engineering ◽

10.4018/978-1-4666-0059-1.ch018 ◽

2012 ◽

pp. 374-402

Author(s):

Diana Mateus ◽

Christian Wachinger ◽

Selen Atasoy ◽

Loren Schwarz ◽

Nassir Navab

Keyword(s):

Manifold Learning ◽

Domain Knowledge ◽

Dimensional Space ◽

Human Motion ◽

Motion Modeling ◽

Learning Methods ◽

Data Representations ◽

Non Linear ◽

Data Points ◽

Low Dimensional

Computer aided diagnosis is often confronted with processing and analyzing high dimensional data. One alternative to deal with such data is dimensionality reduction. This chapter focuses on manifold learning methods to create low dimensional data representations adapted to a given application. From pairwise non-linear relations between neighboring data-points, manifold learning algorithms first approximate the low dimensional manifold where data lives with a graph; then, they find a non-linear map to embed this graph into a low dimensional space. Since the explicit pairwise relations and the neighborhood system can be designed according to the application, manifold learning methods are very flexible and allow easy incorporation of domain knowledge. The authors describe different assumptions and design elements that are crucial to building successful low dimensional data representations with manifold learning for a variety of applications. In particular, they discuss examples for visualization, clustering, classification, registration, and human-motion modeling.

Download Full-text