Clustering, factor discovery and optimal transport

Information and Inference A Journal of the IMA ◽

10.1093/imaiai/iaaa040 ◽

2020 ◽

Author(s):

Hongkang Yang ◽

Esteban G Tabak

Keyword(s):

Latent Variables ◽

Optimal Transport ◽

Clustering Algorithms ◽

Data Sets ◽

Affine Transformations ◽

Real World Data ◽

Continuous Version ◽

Clustering Problem ◽

Latent Space ◽

Transport Maps

Abstract The clustering problem, and more generally latent factor discovery or latent space inference, is formulated in terms of the Wasserstein barycenter problem from optimal transport. The objective proposed is the maximization of the variability attributable to class, further characterized as the minimization of the variance of the Wasserstein barycenter. Existing theory, which constrains the transport maps to rigid translations, is extended to affine transformations. The resulting non-parametric clustering algorithms include $k$-means as a special case and exhibit more robust performance. A continuous version of these algorithms discovers continuous latent variables and generalizes principal curves. The strength of these algorithms is demonstrated by tests on both artificial and real-world data sets.

Download Full-text

A Three-Level Optimization Model for Nonlinearly Separable Clustering

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i04.5719 ◽

2020 ◽

Vol 34 (04) ◽

pp. 3211-3218

Author(s):

Liang Bai ◽

Jiye Liang

Keyword(s):

Optimization Model ◽

Clustering Algorithms ◽

Complex Structure ◽

Computational Cost ◽

Real Data ◽

Data Sets ◽

Real World Data ◽

Clustering Problem ◽

Efficiency And Effectiveness ◽

Clustering Problems

Due to the complex structure of the real-world data, nonlinearly separable clustering is one of popular and widely studied clustering problems. Currently, various types of algorithms, such as kernel k-means, spectral clustering and density clustering, have been developed to solve this problem. However, it is difficult for them to balance the efficiency and effectiveness of clustering, which limits their real applications. To get rid of the deficiency, we propose a three-level optimization model for nonlinearly separable clustering which divides the clustering problem into three sub-problems: a linearly separable clustering on the object set, a nonlinearly separable clustering on the cluster set and an ensemble clustering on the partition set. An iterative algorithm is proposed to solve the optimization problem. The proposed algorithm can use low computational cost to effectively recognize nonlinearly separable clusters. The performance of this algorithm has been studied on synthetical and real data sets. Comparisons with other nonlinearly separable clustering algorithms illustrate the efficiency and effectiveness of the proposed algorithm.

Download Full-text

AANMF: Attribute-Aware Attentional Neural Matrix Factorization

Information Technology And Control ◽

10.5755/j01.itc.48.4.23149 ◽

2019 ◽

Vol 48 (4) ◽

pp. 682-693

Author(s):

Bo Zheng ◽

Jinsong Hu

Keyword(s):

Matrix Factorization ◽

Recommendation System ◽

Auxiliary Information ◽

Inner Product ◽

Data Sets ◽

It Projects ◽

Real World Data ◽

Latent Space ◽

Almost All ◽

Novel Model

Matrix Factorization (MF) is one of the most intuitive and effective methods in the Recommendation System domain. It projects sparse (user, item) interactions into dense feature products which endues strong generality to the MF model. To leverage this interaction, recent works use auxiliary information of users and items. Despite effectiveness, irrationality still exists among these methods, since almost all of them simply add the feature of auxiliary information in dense latent space to the feature of the user or item. In this work, we propose a novel model named AANMF, short for Attribute-aware Attentional Neural Matrix Factorization. AANMF combines two main parts, namely, neural-network-based factorization architecture for modeling inner product and attention-mechanism-based attribute processing cell for attribute handling. Extensive experiments on two real-world data sets demonstrate the robust and stronger performance of our model. Notably, we show that our model can deal with the attributes of user or item more reasonably. Our implementation of AANMF is publicly available at https://github.com/Holy-Shine/AANMF.

Download Full-text

Proficient Normalised Fuzzy K-Means With Initial Centroids Methodology

International Journal of Knowledge Discovery in Bioinformatics ◽

10.4018/ijkdb.2018010104 ◽

2018 ◽

Vol 8 (1) ◽

pp. 42-59

Author(s):

Deepali Virmani ◽

Nikita Jain ◽

Ketan Parikh ◽

Shefali Upadhyaya ◽

Abhishek Srivastav

Keyword(s):

Unsupervised Learning ◽

Real World ◽

Learning Algorithms ◽

Clustering Algorithms ◽

Real World Data ◽

World Data ◽

Clustering Problem ◽

Time Required ◽

Selection Of

This article describes how data is relevant and if it can be organized, linked with other data and grouped into a cluster. Clustering is the process of organizing a given set of objects into a set of disjoint groups called clusters. There are a number of clustering algorithms like k-means, k-medoids, normalized k-means, etc. So, the focus remains on efficiency and accuracy of algorithms. The focus is also on the time it takes for clustering and reducing overlapping between clusters. K-means is one of the simplest unsupervised learning algorithms that solves the well-known clustering problem. The k-means algorithm partitions data into K clusters and the centroids are randomly chosen resulting numeric values prohibits it from being used to cluster real world data containing categorical values. Poor selection of initial centroids can result in poor clustering. This article deals with a proposed algorithm which is a variant of k-means with some modifications resulting in better clustering, reduced overlapping and lesser time required for clustering by selecting initial centres in k-means and normalizing the data.

Download Full-text

A Novel Hierarchical Clustering Approach Based on Universal Gravitation

Mathematical Problems in Engineering ◽

10.1155/2020/6748056 ◽

2020 ◽

Vol 2020 ◽

pp. 1-15

Author(s):

Peng Zhang ◽

Kun She

Keyword(s):

Hierarchical Clustering ◽

Clustering Analysis ◽

Gravitational Force ◽

Clustering Algorithms ◽

Influence Coefficient ◽

Data Sets ◽

Universal Gravitation ◽

Real World Data ◽

Gravitational Influence ◽

Clustering Approach

The target of the clustering analysis is to group a set of data points into several clusters based on the similarity or distance. The similarity or distance is usually a scalar used in numerous traditional clustering algorithms. Nevertheless, a vector, such as data gravitational force, contains more information than a scalar and can be applied in clustering analysis to promote clustering performance. Therefore, this paper proposes a three-stage hierarchical clustering approach called GHC, which takes advantage of the vector characteristic of data gravitational force inspired by the law of universal gravitation. In the first stage, a sparse gravitational graph is constructed based on the top k data gravitations between each data point and its neighbors in the local region. Then the sparse graph is partitioned into many subgraphs by the gravitational influence coefficient. In the last stage, the satisfactory clustering result is obtained by merging these subgraphs iteratively by using a new linkage criterion. To demonstrate the performance of GHC algorithm, the experiments on synthetic and real-world data sets are conducted, and the results show that the GHC algorithm achieves better performance than the other existing clustering algorithms.

Download Full-text

Oversampling for Imbalanced Data via Optimal Transport

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v33i01.33015605 ◽

2019 ◽

Vol 33 ◽

pp. 5605-5612 ◽

Cited By ~ 1

Author(s):

Yuguang Yan ◽

Mingkui Tan ◽

Yanwu Xu ◽

Jiezhang Cao ◽

Michael Ng ◽

...

Keyword(s):

Real World ◽

Optimal Transport ◽

Imbalanced Data ◽

Data Sets ◽

Similar Distribution ◽

Real World Data ◽

Geometric Information ◽

Minority Class ◽

Real World Applications ◽

Multiple Metrics

The issue of data imbalance occurs in many real-world applications especially in medical diagnosis, where normal cases are usually much more than the abnormal cases. To alleviate this issue, one of the most important approaches is the oversampling method, which seeks to synthesize minority class samples to balance the numbers of different classes. However, existing methods barely consider global geometric information involved in the distribution of minority class samples, and thus may incur distribution mismatching between real and synthetic samples. In this paper, relying on optimal transport (Villani 2008), we propose an oversampling method by exploiting global geometric information of data to make synthetic samples follow a similar distribution to that of minority class samples. Moreover, we introduce a novel regularization based on synthetic samples and shift the distribution of minority class samples according to loss information. Experiments on toy and real-world data sets demonstrate the efficacy of our proposed method in terms of multiple metrics.

Download Full-text

Advanced Data Mining and Visualization Techniques with Probabilistic Principal Surfaces

Mathematical Methods for Knowledge Discovery and Data Mining ◽

10.4018/978-1-59904-528-3.ch015 ◽

2011 ◽

pp. 244-264

Author(s):

Antonino Staiano ◽

Lara De Vinco ◽

Giuseppe Longo ◽

Roberto Tagliaferri

Keyword(s):

Density Function ◽

Latent Variables ◽

Latent Variable ◽

Dimensional Space ◽

Three Dimensional ◽

Fixed Number ◽

Data Sets ◽

Real World Data ◽

Variable Model ◽

Principal Surfaces

Probabilistic Principal Surfaces (PPS) is a non linear latent variable model with very powerful visualization and classification capabilities which seem to be able to overcome most of the shortcomings of other neural tools. PPS builds a probability density function of a given set of patterns lying in a high-dimensional space which can be expressed in terms of a fixed number of latent variables lying in a latent Q-dimensional space. Usually, the Q-space is either two or three dimensional and thus the density function can be used to visualize the data within it. The case in which Q = 3 allows to project the patterns on a spherical manifold which turns out to be optimal when dealing with sparse data. PPS may also be arranged in ensembles to tackle complex classification tasks. As template cases we discuss the application of PPS to two real- world data sets from astronomy and genetics.

Download Full-text

ONLINE CLUSTERING ALGORITHMS

International Journal of Neural Systems ◽

10.1142/s0129065708001518 ◽

2008 ◽

Vol 18 (03) ◽

pp. 185-194 ◽

Cited By ~ 44

Author(s):

WESAM BARBAKH ◽

COLIN FYFE

Keyword(s):

Initial Conditions ◽

Clustering Algorithms ◽

Global Optimum ◽

Local Optimum ◽

Data Sets ◽

Performance Function ◽

Online Clustering ◽

Standard Data ◽

Latent Space ◽

Online Learning Algorithms

We introduce a set of clustering algorithms whose performance function is such that the algorithms overcome one of the weaknesses of K-means, its sensitivity to initial conditions which leads it to converge to a local optimum rather than the global optimum. We derive online learning algorithms and illustrate their convergence to optimal solutions which K-means fails to find. We then extend the algorithm by underpinning it with a latent space which enables a topology preserving mapping to be found. We show visualisation results on some standard data sets.

Download Full-text

Clustering techniques and their applications in engineering

Proceedings of the Institution of Mechanical Engineers Part C Journal of Mechanical Engineering Science ◽

10.1243/09544062jmes508 ◽

2007 ◽

Vol 221 (11) ◽

pp. 1445-1459 ◽

Cited By ~ 19

Author(s):

D T Pham ◽

A A Afify

Keyword(s):

Data Mining ◽

Clustering Algorithms ◽

Large Data ◽

Large Data Sets ◽

Data Sets ◽

Monitoring And Control ◽

Design Quality ◽

Clustering Problem ◽

Manufacturing System Design ◽

And Control

Clustering is an important data exploration technique with many applications in different areas of engineering, including engineering design, manufacturing system design, quality assurance, production planning and process planning, modelling, monitoring, and control. The clustering problem has been addressed by researchers from many disciplines. However, efforts to perform effective and efficient clustering on large data sets only started in recent years with the emergence of data mining. The current paper presents an overview of clustering algorithms from a data mining perspective. Attention is paid to techniques of scaling up these algorithms to handle large data sets. The paper also describes a number of engineering applications to illustrate the potential of clustering algorithms as a tool for handling complex real-world problems.

Download Full-text

Subspace Clustering of High-Dimensional Data: An Evolutionary Approach

Applied Computational Intelligence and Soft Computing ◽

10.1155/2013/863146 ◽

2013 ◽

Vol 2013 ◽

pp. 1-12 ◽

Cited By ~ 3

Author(s):

Singh Vijendra ◽

Sahoo Laxman

Keyword(s):

Clustering Algorithm ◽

Dimensional Space ◽

Clustering Algorithms ◽

High Dimensional Data ◽

Subspace Clustering ◽

High Dimensional ◽

Data Sets ◽

Real World Data ◽

Data Set ◽

Data Points

Clustering high-dimensional data has been a major challenge due to the inherent sparsity of the points. Most existing clustering algorithms become substantially inefficient if the required similarity measure is computed between data points in the full-dimensional space. In this paper, we have presented a robust multi objective subspace clustering (MOSCL) algorithm for the challenging problem of high-dimensional clustering. The first phase of MOSCL performs subspace relevance analysis by detecting dense and sparse regions with their locations in data set. After detection of dense regions it eliminates outliers. MOSCL discovers subspaces in dense regions of data set and produces subspace clusters. In thorough experiments on synthetic and real-world data sets, we demonstrate that MOSCL for subspace clustering is superior to PROCLUS clustering algorithm. Additionally we investigate the effects of first phase for detecting dense regions on the results of subspace clustering. Our results indicate that removing outliers improves the accuracy of subspace clustering. The clustering results are validated by clustering error (CE) distance on various data sets. MOSCL can discover the clusters in all subspaces with high quality, and the efficiency of MOSCL outperforms PROCLUS.

Download Full-text

Robust MST-Based Clustering Algorithm

Neural Computation ◽

10.1162/neco_a_01081 ◽

2018 ◽

Vol 30 (6) ◽

pp. 1624-1646 ◽

Cited By ~ 1

Author(s):

Qidong Liu ◽

Ruisheng Zhang ◽

Zhili Zhao ◽

Zhenghai Wang ◽

Mengyao Jiao ◽

...

Keyword(s):

Clustering Algorithm ◽

Minimum Spanning Tree ◽

Clustering Algorithms ◽

Low Rank ◽

Data Sets ◽

Real World Data ◽

Data Set ◽

Rank Matrix ◽

Data Points ◽

Low Rank Matrix

Minimax similarity stresses the connectedness of points via mediating elements rather than favoring high mutual similarity. The grouping principle yields superior clustering results when mining arbitrarily-shaped clusters in data. However, it is not robust against noises and outliers in the data. There are two main problems with the grouping principle: first, a single object that is far away from all other objects defines a separate cluster, and second, two connected clusters would be regarded as two parts of one cluster. In order to solve such problems, we propose robust minimum spanning tree (MST)-based clustering algorithm in this letter. First, we separate the connected objects by applying a density-based coarsening phase, resulting in a low-rank matrix in which the element denotes the supernode by combining a set of nodes. Then a greedy method is presented to partition those supernodes through working on the low-rank matrix. Instead of removing the longest edges from MST, our algorithm groups the data set based on the minimax similarity. Finally, the assignment of all data points can be achieved through their corresponding supernodes. Experimental results on many synthetic and real-world data sets show that our algorithm consistently outperforms compared clustering algorithms.

Download Full-text