Method for Determining the Optimal Number of Clusters Based on Agglomerative Hierarchical Clustering

2017 ◽  
Vol 28 (12) ◽  
pp. 3007-3017 ◽  
Author(s):  
Shibing Zhou ◽  
Zhenyuan Xu ◽  
Fei Liu
2019 ◽  
Vol 2019 ◽  
pp. 1-11 ◽  
Author(s):  
Hui Huang ◽  
Yan Ma

The Bag-of-Words (BoW) model is a well-known image categorization technique. However, in conventional BoW, neither the vocabulary size nor the visual words can be determined automatically. To overcome these problems, a hybrid clustering approach that combines improved hierarchical clustering with a K-means algorithm is proposed. We present a cluster validity index for the hierarchical clustering algorithm to adaptively determine when the algorithm should terminate and the optimal number of clusters. Furthermore, we improve the max-min distance method to optimize the initial cluster centers. The optimal number of clusters and initial cluster centers are fed into K-means, and finally the vocabulary size and visual words are obtained. The proposed approach is extensively evaluated on two visual datasets. The experimental results show that the proposed method outperforms the conventional BoW model in terms of categorization and demonstrate the feasibility and effectiveness of our approach.


Author(s):  
Yukihiro Hamasuna ◽  
Shusuke Nakano ◽  
Ryo Ozaki ◽  
and Yasunori Endo ◽  
◽  
...  

The Louvain method is a method of agglomerative hierarchical clustering (AHC) that uses modularity as the merging criterion. Modularity is an evaluation measure for network partitions. Cluster validity measures are also used to evaluate cluster partitions and to determine the optimal number of clusters. Several cluster validity measures are constructed considering the geometric features of clusters. These measures and modularity are considered to be the same concept in the viewpoint of evaluating cluster partitions. In this paper, cluster validity measures based agglomerative hierarchical clustering (CVAHC) is proposed as a novel clustering method for network data. The cluster validity measures are used as a merging criterion and an evaluation measure for network data in the proposed method. Numerical experiments show that Dunn’s and Xie-Beni’s indices for network partitions are useful for network clustering.


2018 ◽  
Vol 15 (2) ◽  
Author(s):  
Zdeněk Šulc ◽  
Jana Cibulková ◽  
Jiří Procházka ◽  
Hana Řezanková

The paper compares 11 internal evaluation criteria for hierarchical clustering of categorical data regarding a correct number of clusters determination. The criteria are divided into three groups based on a way of treating the cluster quality. The variability-based criteria use the within-cluster variability, the likelihood-based criteria maximize the likelihood function, and the distance-based criteria use distances within and between clusters. The aim is to determine which evaluation criteria perform well and under what conditions. Different analysis settings, such as the used method of hierarchical clustering, and various dataset properties, such as the number of variables or the minimal between-cluster distances, are examined. The experiment is conducted on 810 generated datasets, where the evaluation criteria are assessed regarding the optimal number of clusters determination and mean absolute errors. The results indicate that the likelihood-based BIC1 and variability-based BK criteria perform relatively well in determining the optimal number of clusters and that some criteria, usually the distance-based ones, should be avoided.


2018 ◽  
Vol 14 (1) ◽  
pp. 11-23 ◽  
Author(s):  
Lin Zhang ◽  
Yanling He ◽  
Huaizhi Wang ◽  
Hui Liu ◽  
Yufei Huang ◽  
...  

Background: RNA methylome has been discovered as an important layer of gene regulation and can be profiled directly with count-based measurements from high-throughput sequencing data. Although the detailed regulatory circuit of the epitranscriptome remains uncharted, clustering effect in methylation status among different RNA methylation sites can be identified from transcriptome-wide RNA methylation profiles and may reflect the epitranscriptomic regulation. Count-based RNA methylation sequencing data has unique features, such as low reads coverage, which calls for novel clustering approaches. <P><P> Objective: Besides the low reads coverage, it is also necessary to keep the integer property to approach clustering analysis of count-based RNA methylation sequencing data. <P><P> Method: We proposed a nonparametric generative model together with its Gibbs sampling solution for clustering analysis. The proposed approach implements a beta-binomial mixture model to capture the clustering effect in methylation level with the original count-based measurements rather than an estimated continuous methylation level. Besides, it adopts a nonparametric Dirichlet process to automatically determine an optimal number of clusters so as to avoid the common model selection problem in clustering analysis. <P><P> Results: When tested on the simulated system, the method demonstrated improved clustering performance over hierarchical clustering, K-means, MClust, NMF and EMclust. It also revealed on real dataset two novel RNA N6-methyladenosine (m6A) co-methylation patterns that may be induced directly by METTL14 and WTAP, which are two known regulatory components of the RNA m6A methyltransferase complex. <P><P> Conclusion: Our proposed DPBBM method not only properly handles the count-based measurements of RNA methylation data from sites of very low reads coverage, but also learns an optimal number of clusters adaptively from the data analyzed. <P><P> Availability: The source code and documents of DPBBM R package are freely available through the Comprehensive R Archive Network (CRAN): https://cran.r-project.org/web/packages/DPBBM/.


Sign in / Sign up

Export Citation Format

Share Document