A parameter-less algorithm for tensor co-clustering

Machine Learning ◽

10.1007/s10994-021-06002-w ◽

2021 ◽

Author(s):

Elena Battaglia ◽

Ruggero G. Pensa

Keyword(s):

Clustering Algorithm ◽

Tensor Factorization ◽

Clustering Methods ◽

Convergence Properties ◽

Physical Systems ◽

Discrete Random Variables ◽

Real World Datasets ◽

Measure Of Association ◽

Block Models ◽

Optimization Schemes

AbstractThe majority of the data produced by human activities and modern cyber-physical systems involve complex relations among their features. Such relations can be often represented by means of tensors, which can be viewed as generalization of matrices and, as such, can be analyzed by using higher-order extensions of existing machine learning methods, such as clustering and co-clustering. Tensor co-clustering, in particular, has been proven useful in many applications, due to its ability of coping with n-modal data and sparsity. However, setting up a co-clustering algorithm properly requires the specification of the desired number of clusters for each mode as input parameters. This choice is already difficult in relatively easy settings, like flat clustering on data matrices, but on tensors it could be even more frustrating. To face this issue, we propose a new tensor co-clustering algorithm that does not require the number of desired co-clusters as input, as it optimizes an objective function based on a measure of association across discrete random variables (called Goodman and Kruskal’s $$\tau$$ τ ) that is not affected by their cardinality. We introduce different optimization schemes and show their theoretical and empirical convergence properties. Additionally, we show the effectiveness of our algorithm on both synthetic and real-world datasets, also in comparison with state-of-the-art co-clustering methods based on tensor factorization and latent block models.

Download Full-text

Doubly Aligned Incomplete Multi-view Clustering

Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2018/313 ◽

2018 ◽

Cited By ~ 12

Author(s):

Menglei Hu ◽

Songcan Chen

Keyword(s):

Clustering Algorithm ◽

Nonnegative Matrix Factorization ◽

Nonnegative Matrix ◽

Regularized Regression ◽

Clustering Methods ◽

Basis Matrix ◽

Real World Datasets ◽

The One ◽

Almost All ◽

The Given

Nowadays, multi-view clustering has attracted more and more attention. To date, almost all the previous studies assume that views are complete. However, in reality, it is often the case that each view may contain some missing instances. Such incompleteness makes it impossible to directly use traditional multi-view clustering methods. In this paper, we propose a Doubly Aligned Incomplete Multi-view Clustering algorithm (DAIMC) based on weighted semi-nonnegative matrix factorization (semi-NMF). Specifically, on the one hand, DAIMC utilizes the given instance alignment information to learn a common latent feature matrix for all the views. On the other hand, DAIMC establishes a consensus basis matrix with the help of L2,1-Norm regularized regression for reducing the influence of missing instances. Consequently, compared with existing methods, besides inheriting the strength of semi-NMF with ability to handle negative entries, DAIMC has two unique advantages: 1) solving the incomplete view problem by introducing a respective weight matrix for each view, making it able to easily adapt to the case with more than two views; 2) reducing the influence of view incompleteness on clustering by enforcing the basis matrices of individual views being aligned with the help of regression. Experiments on four real-world datasets demonstrate its advantages.

Download Full-text

An Affinity Propagation Clustering Algorithm for Mixed Numeric and Categorical Datasets

Mathematical Problems in Engineering ◽

10.1155/2014/486075 ◽

2014 ◽

Vol 2014 ◽

pp. 1-8 ◽

Cited By ~ 7

Author(s):

Kang Zhang ◽

Xingsheng Gu

Keyword(s):

Real World ◽

Clustering Algorithm ◽

Clustering Algorithms ◽

Affinity Propagation ◽

Mixed Data ◽

Clustering Methods ◽

Affinity Propagation Clustering ◽

Real World Datasets ◽

Data Objects ◽

Clustering Problems

Clustering has been widely used in different fields of science, technology, social science, and so forth. In real world, numeric as well as categorical features are usually used to describe the data objects. Accordingly, many clustering methods can process datasets that are either numeric or categorical. Recently, algorithms that can handle the mixed data clustering problems have been developed. Affinity propagation (AP) algorithm is an exemplar-based clustering method which has demonstrated good performance on a wide variety of datasets. However, it has limitations on processing mixed datasets. In this paper, we propose a novel similarity measure for mixed type datasets and an adaptive AP clustering algorithm is proposed to cluster the mixed datasets. Several real world datasets are studied to evaluate the performance of the proposed algorithm. Comparisons with other clustering algorithms demonstrate that the proposed method works well not only on mixed datasets but also on pure numeric and categorical datasets.

Download Full-text

A Density Peak Clustering Algorithm Based on the K-Nearest Shannon Entropy and Tissue-Like P System

Mathematical Problems in Engineering ◽

10.1155/2019/1713801 ◽

2019 ◽

Vol 2019 ◽

pp. 1-13 ◽

Cited By ~ 3

Author(s):

Zhenni Jiang ◽

Xiyu Liu ◽

Minghe Sun

Keyword(s):

Shannon Entropy ◽

Clustering Algorithm ◽

P Systems ◽

P System ◽

Clustering Methods ◽

K Nearest Neighbors ◽

Density Peak ◽

New Variant ◽

Real World Datasets ◽

Density Peak Clustering

This study proposes a novel method to calculate the density of the data points based on K-nearest neighbors and Shannon entropy. A variant of tissue-like P systems with active membranes is introduced to realize the clustering process. The new variant of tissue-like P systems can improve the efficiency of the algorithm and reduce the computation complexity. Finally, experimental results on synthetic and real-world datasets show that the new method is more effective than the other state-of-the-art clustering methods.

Download Full-text

Designing a parallel Feel-the-Way clustering algorithm on HPC systems

The International Journal of High Performance Computing Applications ◽

10.1177/1094342020975194 ◽

2020 ◽

pp. 109434202097519

Author(s):

Weijian Zheng ◽

Dali Wang ◽

Fengguang Song

Keyword(s):

Convergence Rate ◽

High Performance ◽

Clustering Algorithm ◽

Algorithm Design ◽

Computing System ◽

Clustering Methods ◽

Clustering Method ◽

Real World Datasets ◽

Number Of Iterations ◽

The Way

This paper introduces a new parallel clustering algorithm, named Feel-the-Way clustering algorithm, that provides better or equivalent convergence rate than the traditional clustering methods by optimizing the synchronization and communication costs. Our algorithm design centers on how to optimize three factors simultaneously: reduced synchronizations, improved convergence rate, and retained same or comparable optimization cost. To compare the optimization cost, we use the Sum of Square Error (SSE) cost as the metric, which is the sum of the square distance between each data point and its assigned clusters. Compared with the traditional MPI k-means algorithm, the new Feel-the-Way algorithm requires less communications among participating processes. As for the convergence rate, the new algorithm requires fewer number of iterations to converge. As for the optimization cost, it obtains the SSE costs that are close to the k-means algorithm. In the paper, we first design the full-step Feel-the-Way k-means clustering algorithm that can significantly reduce the number of iterations that are required by the original k-means clustering method. Next, we improve the performance of the full-step algorithm by adopting an optimized sampling-based approach, named reassignment-history-aware sampling. Our experimental results show that the optimized sampling-based Feel-the-Way method is significantly faster than the widely used k-means clustering method, and can provide comparable optimization costs. More extensive experiments with several synthetic datasets and real-world datasets (e.g., MNIST, CIFAR-10, ENRON, and PLACES-2) show that the new parallel algorithm can outperform the open source MPI k-means library by up to 110% on a high-performance computing system using 4,096 CPU cores. In addition, the new algorithm can take up to 51% fewer iterations to converge than the k-means clustering algorithm.

Download Full-text

Multiway p-spectral graph cuts on Grassmann manifolds

Machine Learning ◽

10.1007/s10994-021-06108-1 ◽

2021 ◽

Author(s):

Dimosthenis Pasadakis ◽

Christie Louis Alappat ◽

Olaf Schenk ◽

Gerhard Wellein

Keyword(s):

Spectral Clustering ◽

Clustering Algorithm ◽

Unconstrained Minimization ◽

Graph Cuts ◽

Graph Laplacian ◽

Clustering Methods ◽

Real World Datasets ◽

Comparative Results ◽

Spectral Clustering Algorithm

AbstractNonlinear reformulations of the spectral clustering method have gained a lot of recent attention due to their increased numerical benefits and their solid mathematical background. We present a novel direct multiway spectral clustering algorithm in the p-norm, for $$p\in (1,2]$$ p ∈ ( 1 , 2 ] . The problem of computing multiple eigenvectors of the graph p-Laplacian, a nonlinear generalization of the standard graph Laplacian, is recasted as an unconstrained minimization problem on a Grassmann manifold. The value of p is reduced in a pseudocontinuous manner, promoting sparser solution vectors that correspond to optimal graph cuts as p approaches one. Monitoring the monotonic decrease of the balanced graph cuts guarantees that we obtain the best available solution from the p-levels considered. We demonstrate the effectiveness and accuracy of our algorithm in various artificial test-cases. Our numerical examples and comparative results with various state-of-the-art clustering methods indicate that the proposed method obtains high quality clusters both in terms of balanced graph cut metrics and in terms of the accuracy of the labelling assignment. Furthermore, we conduct studies for the classification of facial images and handwritten characters to demonstrate the applicability in real-world datasets.

Download Full-text

Factor-Bounded Nonnegative Matrix Factorization

ACM Transactions on Knowledge Discovery from Data ◽

10.1145/3451395 ◽

2021 ◽

Vol 15 (6) ◽

pp. 1-18

Author(s):

Kai Liu ◽

Xiangyu Li ◽

Zhihui Zhu ◽

Lodewijk Brand ◽

Hua Wang

Keyword(s):

Matrix Factorization ◽

Clustering Algorithm ◽

Nonnegative Matrix Factorization ◽

Nonnegative Matrix ◽

Optimization Methods ◽

Auxiliary Function ◽

Image Clustering ◽

Real World Datasets ◽

The Relationship ◽

Matrix Factors

Nonnegative Matrix Factorization (NMF) is broadly used to determine class membership in a variety of clustering applications. From movie recommendations and image clustering to visual feature extractions, NMF has applications to solve a large number of knowledge discovery and data mining problems. Traditional optimization methods, such as the Multiplicative Updating Algorithm (MUA), solves the NMF problem by utilizing an auxiliary function to ensure that the objective monotonically decreases. Although the objective in MUA converges, there exists no proof to show that the learned matrix factors converge as well. Without this rigorous analysis, the clustering performance and stability of the NMF algorithms cannot be guaranteed. To address this knowledge gap, in this article, we study the factor-bounded NMF problem and provide a solution algorithm with proven convergence by rigorous mathematical analysis, which ensures that both the objective and matrix factors converge. In addition, we show the relationship between MUA and our solution followed by an analysis of the convergence of MUA. Experiments on both toy data and real-world datasets validate the correctness of our proposed method and its utility as an effective clustering algorithm.

Download Full-text

CLUSTERING USING AN IMPROVED HYBRID GENETIC ALGORITHM

International Journal of Artificial Intelligence Tools ◽

10.1142/s021821300700362x ◽

2007 ◽

Vol 16 (06) ◽

pp. 919-934

Author(s):

YONGGUO LIU ◽

XIAORONG PU ◽

YIDONG SHEN ◽

ZHANG YI ◽

XIAOFENG LIAO

Keyword(s):

Genetic Algorithm ◽

Clustering Algorithm ◽

Hybrid Genetic Algorithm ◽

Sum Of Squares ◽

Clustering Methods ◽

Clustering Problem ◽

Mutation Operation ◽

Iteration Methods ◽

Genetic Clustering ◽

The Individual

In this article, a new genetic clustering algorithm called the Improved Hybrid Genetic Clustering Algorithm (IHGCA) is proposed to deal with the clustering problem under the criterion of minimum sum of squares clustering. In IHGCA, the improvement operation including five local iteration methods is developed to tune the individual and accelerate the convergence speed of the clustering algorithm, and the partition-absorption mutation operation is designed to reassign objects among different clusters. By experimental simulations, its superiority over some known genetic clustering methods is demonstrated.

Download Full-text

Empirical Evaluation of Genetic Clustering Methods Using Multilocus Genotypes From 20 Chicken Breeds

Genetics ◽

10.1093/genetics/159.2.699 ◽

2001 ◽

Vol 159 (2) ◽

pp. 699-713

Author(s):

Noah A Rosenberg ◽

Terry Burke ◽

Kari Elo ◽

Marcus W Feldman ◽

Paul J Freidlin ◽

...

Keyword(s):

Cluster Analysis ◽

Population Structure ◽

Clustering Algorithm ◽

Empirical Evaluation ◽

Unknown Origin ◽

Clustering Methods ◽

Genetic Cluster ◽

Data Set ◽

Multilocus Genotypes ◽

Chicken Breeds

Abstract We tested the utility of genetic cluster analysis in ascertaining population structure of a large data set for which population structure was previously known. Each of 600 individuals representing 20 distinct chicken breeds was genotyped for 27 microsatellite loci, and individual multilocus genotypes were used to infer genetic clusters. Individuals from each breed were inferred to belong mostly to the same cluster. The clustering success rate, measuring the fraction of individuals that were properly inferred to belong to their correct breeds, was consistently ~98%. When markers of highest expected heterozygosity were used, genotypes that included at least 8–10 highly variable markers from among the 27 markers genotyped also achieved >95% clustering success. When 12–15 highly variable markers and only 15–20 of the 30 individuals per breed were used, clustering success was at least 90%. We suggest that in species for which population structure is of interest, databases of multilocus genotypes at highly variable markers should be compiled. These genotypes could then be used as training samples for genetic cluster analysis and to facilitate assignments of individuals of unknown origin to populations. The clustering algorithm has potential applications in defining the within-species genetic units that are useful in problems of conservation.

Download Full-text

An Improved Version of K-medoid Algorithm using CRO

Modern Applied Science ◽

10.5539/mas.v12n2p116 ◽

2018 ◽

Vol 12 (2) ◽

pp. 116 ◽

Cited By ~ 2

Author(s):

Amjad Hudaib ◽

Mohammad Khanafseh ◽

Ola Surakhi

Keyword(s):

Breast Cancer ◽

Lung Cancer ◽

Hybrid Algorithm ◽

Clustering Algorithm ◽

Clustering Algorithms ◽

Data Repository ◽

The Mean ◽

Real World Datasets ◽

Actual Point ◽

Learning Data

Clustering is the process of grouping a set of patterns into different disjoint clusters where each cluster contains the alike patterns. Many algorithms had been proposed before for clustering. K-medoid is a variant of k-mean that use an actual point in the cluster to represent it instead of the mean in the k-mean algorithm to get the outliers and reduce noise in the cluster. In order to enhance performance of k-medoid algorithm and get more accurate clusters, a hybrid algorithm is proposed which use CRO algorithm along with k-medoid. In this method, CRO is used to expand searching for the optimal medoid and enhance clustering by getting more precise results. The performance of the new algorithm is evaluated by comparing its results with five clustering algorithms, k-mean, k-medoid, DB/rand/1/bin, CRO based clustering algorithm and hybrid CRO-k-mean by using four real world datasets: Lung cancer, Iris, Breast cancer Wisconsin and Haberman’s survival from UCI machine learning data repository. The results were conducted and compared base on different metrics and show that proposed algorithm enhanced clustering technique by giving more accurate results.

Download Full-text

A Novel High-Dimensional Trajectories Construction Network based on Multi-Clustering Algorithm

10.21203/rs.3.rs-1060086/v1 ◽

2021 ◽

Author(s):

Feiyang Ren ◽

Yi Han ◽

Shaohan Wang ◽

He Jiang

Keyword(s):

Economic Analysis ◽

Clustering Algorithm ◽

Transportation Network ◽

High Dimensional ◽

Clustering Methods ◽

Marine Transportation ◽

Network Construction ◽

National Economic ◽

Multi Level ◽

State Of Art

Abstract A novel marine transportation network based on high-dimensional AIS data with a multi-level clustering algorithm is proposed to discover important waypoints in trajectories based on selected navigation features. This network contains two parts: the calculation of major nodes with CLIQUE and BIRCH clustering methods and navigation network construction with edge construction theory. Unlike the state-of-art work for navigation clustering with only ship coordinate, the proposed method contains more high-dimensional features such as drafting, weather, and fuel consumption. By comparing the historical AIS data, more than 220,133 lines of data in 30 days were used to extract 440 major nodal points in less than 4 minutes with ordinary PC specs (i5 processer). The proposed method can be performed on more dimensional data for better ship path planning or even national economic analysis. Current work has shown good performance on complex ship trajectories distinction and great potential for future shipping transportation market analytical predictions.

Download Full-text