Method for Determining the Optimal Number of Clusters Based on Agglomerative Hierarchical Clustering

The Bag-of-Words (BoW) model is a well-known image categorization technique. However, in conventional BoW, neither the vocabulary size nor the visual words can be determined automatically. To overcome these problems, a hybrid clustering approach that combines improved hierarchical clustering with a K-means algorithm is proposed. We present a cluster validity index for the hierarchical clustering algorithm to adaptively determine when the algorithm should terminate and the optimal number of clusters. Furthermore, we improve the max-min distance method to optimize the initial cluster centers. The optimal number of clusters and initial cluster centers are fed into K-means, and finally the vocabulary size and visual words are obtained. The proposed approach is extensively evaluated on two visual datasets. The experimental results show that the proposed method outperforms the conventional BoW model in terms of categorization and demonstrate the feasibility and effectiveness of our approach.

Download Full-text

Cluster Validity Measures Based Agglomerative Hierarchical Clustering for Network Data

Journal of Advanced Computational Intelligence and Intelligent Informatics ◽

10.20965/jaciii.2019.p0577 ◽

2019 ◽

Vol 23 (3) ◽

pp. 577-583

Author(s):

Yukihiro Hamasuna ◽

Shusuke Nakano ◽

Ryo Ozaki ◽

and Yasunori Endo ◽

◽

...

Keyword(s):

Hierarchical Clustering ◽

Optimal Number ◽

Network Data ◽

Network Clustering ◽

Agglomerative Hierarchical Clustering ◽

Cluster Validity ◽

Validity Measures ◽

Evaluation Measure ◽

Network Partitions ◽

Optimal Number Of Clusters

The Louvain method is a method of agglomerative hierarchical clustering (AHC) that uses modularity as the merging criterion. Modularity is an evaluation measure for network partitions. Cluster validity measures are also used to evaluate cluster partitions and to determine the optimal number of clusters. Several cluster validity measures are constructed considering the geometric features of clusters. These measures and modularity are considered to be the same concept in the viewpoint of evaluating cluster partitions. In this paper, cluster validity measures based agglomerative hierarchical clustering (CVAHC) is proposed as a novel clustering method for network data. The cluster validity measures are used as a merging criterion and an evaluation measure for network data in the proposed method. Numerical experiments show that Dunn’s and Xie-Beni’s indices for network partitions are useful for network clustering.

Download Full-text

Internal evaluation criteria for categorical data in hierarchical clustering

Advances in Methodology and Statistics ◽

10.51936/lxut1974 ◽

2018 ◽

Vol 15 (2) ◽

Author(s):

Zdeněk Šulc ◽

Jana Cibulková ◽

Jiří Procházka ◽

Hana Řezanková

Keyword(s):

Hierarchical Clustering ◽

Categorical Data ◽

Likelihood Function ◽

Evaluation Criteria ◽

Optimal Number ◽

Number Of Clusters ◽

Internal Evaluation ◽

Correct Number ◽

Cluster Quality ◽

Optimal Number Of Clusters

The paper compares 11 internal evaluation criteria for hierarchical clustering of categorical data regarding a correct number of clusters determination. The criteria are divided into three groups based on a way of treating the cluster quality. The variability-based criteria use the within-cluster variability, the likelihood-based criteria maximize the likelihood function, and the distance-based criteria use distances within and between clusters. The aim is to determine which evaluation criteria perform well and under what conditions. Different analysis settings, such as the used method of hierarchical clustering, and various dataset properties, such as the number of variables or the minimal between-cluster distances, are examined. The experiment is conducted on 810 generated datasets, where the evaluation criteria are assessed regarding the optimal number of clusters determination and mean absolute errors. The results indicate that the likelihood-based BIC1 and variability-based BK criteria perform relatively well in determining the optimal number of clusters and that some criteria, usually the distance-based ones, should be avoided.

Download Full-text

Indexes to Find the Optimal Number of Clusters in a Hierarchical Clustering

Advances in Intelligent Systems and Computing - 14th International Conference on Soft Computing Models in Industrial and Environmental Applications (SOCO 2019) ◽

10.1007/978-3-030-20055-8_1 ◽

2019 ◽

pp. 3-13

Author(s):

José David Martín-Fernández ◽

José María Luna-Romera ◽

Beatriz Pontes ◽

José C. Riquelme-Santos

Keyword(s):

Hierarchical Clustering ◽

Optimal Number ◽

Number Of Clusters ◽

Optimal Number Of Clusters

Download Full-text

Method for determining optimal number of clusters in K-means clustering algorithm

Journal of Computer Applications ◽

10.3724/sp.j.1087.2010.01995 ◽

2010 ◽

Vol 30 (8) ◽

pp. 1995-1998 ◽

Cited By ~ 18

Author(s):

Shi-bing ZHOU ◽

Zhen-yuan XU ◽

Xu-qing TANG

Keyword(s):

Clustering Algorithm ◽

Optimal Number ◽

Number Of Clusters ◽

Optimal Number Of Clusters

Download Full-text

Clustering Count-based RNA Methylation Data Using a Nonparametric Generative Model

Current Bioinformatics ◽

10.2174/1574893613666180601080008 ◽

2018 ◽

Vol 14 (1) ◽

pp. 11-23 ◽

Cited By ~ 3

Author(s):

Lin Zhang ◽

Yanling He ◽

Huaizhi Wang ◽

Hui Liu ◽

Yufei Huang ◽

...

Keyword(s):

Clustering Analysis ◽

Methylation Level ◽

Optimal Number ◽

Generative Model ◽

Methylation Data ◽

Sequencing Data ◽

Number Of Clusters ◽

Rna Methylation ◽

Clustering Effect ◽

Optimal Number Of Clusters

Background: RNA methylome has been discovered as an important layer of gene regulation and can be profiled directly with count-based measurements from high-throughput sequencing data. Although the detailed regulatory circuit of the epitranscriptome remains uncharted, clustering effect in methylation status among different RNA methylation sites can be identified from transcriptome-wide RNA methylation profiles and may reflect the epitranscriptomic regulation. Count-based RNA methylation sequencing data has unique features, such as low reads coverage, which calls for novel clustering approaches. Objective: Besides the low reads coverage, it is also necessary to keep the integer property to approach clustering analysis of count-based RNA methylation sequencing data. Method: We proposed a nonparametric generative model together with its Gibbs sampling solution for clustering analysis. The proposed approach implements a beta-binomial mixture model to capture the clustering effect in methylation level with the original count-based measurements rather than an estimated continuous methylation level. Besides, it adopts a nonparametric Dirichlet process to automatically determine an optimal number of clusters so as to avoid the common model selection problem in clustering analysis. Results: When tested on the simulated system, the method demonstrated improved clustering performance over hierarchical clustering, K-means, MClust, NMF and EMclust. It also revealed on real dataset two novel RNA N6-methyladenosine (m6A) co-methylation patterns that may be induced directly by METTL14 and WTAP, which are two known regulatory components of the RNA m6A methyltransferase complex. Conclusion: Our proposed DPBBM method not only properly handles the count-based measurements of RNA methylation data from sites of very low reads coverage, but also learns an optimal number of clusters adaptively from the data analyzed. Availability: The source code and documents of DPBBM R package are freely available through the Comprehensive R Archive Network (CRAN): https://cran.r-project.org/web/packages/DPBBM/.

Download Full-text