Selection of Optimal Number of Clusters and Centroids for K-means and Fuzzy C-means Clustering: A Review

According to the number of cluster centers, initial cluster centers, fuzzy factor, iterations and threshold, Fuzzy C-means clustering algorithm (FCM) clusters the data set. FCM will encounter the initialization problem of clustering prototype. Firstly, the article combines the maximum and minimum distance algorithm and K-means algorithm to determine the number of clusters and the initial cluster centers. Secondly, the article determines the optimal number of clusters with Silhouette indicators. Finally, the article improves the convergence rate of FCM by revising membership constantly. The improved FCM has good clustering effect, enhances the optimized capability, and improves the efficiency and effectiveness of the clustering. It has better tightness in the class, scatter among classes and cluster stability and faster convergence rate than the traditional FCM clustering method.

Download Full-text

Investigating cluster validation metrics for optimal number of clusters determination

Intelligent Decision Technologies ◽

10.3233/idt-210187 ◽

2021 ◽

pp. 1-16

Author(s):

Aikaterini Karanikola ◽

Charalampos M. Liapis ◽

Sotiris Kotsiantis

Keyword(s):

Real World ◽

Optimal Number ◽

Cluster Validation ◽

Clustering Methods ◽

Number Of Clusters ◽

Validity Indices ◽

Selection Of ◽

Specific Distance ◽

Optimal Number Of Clusters

In short, clustering is the process of partitioning a given set of objects into groups containing highly related instances. This relation is determined by a specific distance metric with which the intra-cluster similarity is estimated. Finding an optimal number of such partitions is usually the key step in the entire process, yet a rather difficult one. Selecting an unsuitable number of clusters might lead to incorrect conclusions and, consequently, to wrong decisions: the term “optimal” is quite ambiguous. Furthermore, various inherent characteristics of the datasets, such as clusters that overlap or clusters containing subclusters, will most often increase the level of difficulty of the task. Thus, the methods used to detect similarities and the parameter selection of the partition algorithm have a major impact on the quality of the groups and the identification of their optimal number. Given that each dataset constitutes a rather distinct case, validity indices are indicators introduced to address the problem of selecting such an optimal number of clusters. In this work, an extensive set of well-known validity indices, based on the approach of the so-called relative criteria, are examined comparatively. A total of 26 cluster validation measures were investigated in two distinct case studies: one in real-world and one in artificially generated data. To ensure a certain degree of difficulty, both real-world and generated data were selected to exhibit variations and inhomogeneity. Each of the indices is being deployed under the schemes of 9 different clustering methods, which incorporate 5 different distance metrics. All results are presented in various explanatory forms.

Download Full-text

Enhanced cluster validity index for the evaluation of optimal number of clusters for Fuzzy C-Means algorithm

2014 IEEE International Conference on Fuzzy Systems (FUZZ-IEEE) ◽

10.1109/fuzz-ieee.2014.6891591 ◽

2014 ◽

Cited By ~ 10

Author(s):

Neha Bharill ◽

Aruna Tiwari

Keyword(s):

Optimal Number ◽

Cluster Validity ◽

Cluster Validity Index ◽

Validity Index ◽

Number Of Clusters ◽

Fuzzy C Means ◽

Fuzzy C Means Algorithm ◽

Optimal Number Of Clusters

Download Full-text

Fault Diagnosis Based on Fuzzy C-means Algorithm of the Optimal Number of Clusters and Probabilistic Neural Network

International Journal of Intelligent Engineering and Systems ◽

10.22266/ijies2011.0630.06 ◽

2011 ◽

Vol 4 (2) ◽

pp. 51-59 ◽

Cited By ~ 5

Author(s):

Qing Yang ◽

◽

Jingran Guo ◽

Dongxu Zhang ◽

Chang Liu ◽

...

Keyword(s):

Neural Network ◽

Fault Diagnosis ◽

Probabilistic Neural Network ◽

Optimal Number ◽

Number Of Clusters ◽

Fuzzy C Means ◽

Fuzzy C Means Algorithm ◽

Optimal Number Of Clusters

Download Full-text

Delineation of homogeneous regions for streamflow via fuzzy c-means in the Amazon

Water Practice & Technology ◽

10.2166/wpt.2018.035 ◽

2018 ◽

Vol 13 (1) ◽

pp. 210-218 ◽

Cited By ~ 2

Author(s):

Francisco Carlos Lira Pessoa ◽

Claudio José Cavalcante Blanco ◽

Evanice Pinheiro Gomes

Keyword(s):

Cluster Analysis ◽

Optimal Number ◽

Amazon Region ◽

Number Of Clusters ◽

Fuzzy C Means ◽

Important Stage ◽

Streamflow Data ◽

Homogeneous Regions ◽

Climatic Characteristics ◽

Optimal Number Of Clusters

Abstract Lack of streamflow data is one of the main limitations in hydrologic studies. One method of solving this problem is by streamflow regionalization. The identification of hydrologically homogeneous regions is the main and most important stage of regionalization. In this study homogeneous flow regions are identified by fuzzy c-means (FCM) cluster analysis based on morpho-climatic characteristics from streamflow at 208 stream gauges in the Amazon region. The optimal number of clusters in the dataset was identified by applying the PBM validation index, maximized for ten clusters, with a fuzzing parameter of 1.6. The application dataset is best divided into 10 groups. These were well defined and demonstrated the Amazon's hydrologic similarity.

Download Full-text

Fuzzy C-Means Algorithm Automatically Determining Optimal Number of Clusters

Computers Materials & Continua ◽

10.32604/cmc.2019.04500 ◽

2019 ◽

Vol 60 (2) ◽

pp. 767-780

Author(s):

Ruikang Xing ◽

Chenghai Li

Keyword(s):

Optimal Number ◽

Number Of Clusters ◽

Fuzzy C Means ◽

Fuzzy C Means Algorithm ◽

Optimal Number Of Clusters

Download Full-text

Method for determining optimal number of clusters in K-means clustering algorithm

Journal of Computer Applications ◽

10.3724/sp.j.1087.2010.01995 ◽

2010 ◽

Vol 30 (8) ◽

pp. 1995-1998 ◽

Cited By ~ 18

Author(s):

Shi-bing ZHOU ◽

Zhen-yuan XU ◽

Xu-qing TANG

Keyword(s):

Clustering Algorithm ◽

Optimal Number ◽

Number Of Clusters ◽

Optimal Number Of Clusters

Download Full-text

Clustering Count-based RNA Methylation Data Using a Nonparametric Generative Model

Current Bioinformatics ◽

10.2174/1574893613666180601080008 ◽

2018 ◽

Vol 14 (1) ◽

pp. 11-23 ◽

Cited By ~ 3

Author(s):

Lin Zhang ◽

Yanling He ◽

Huaizhi Wang ◽

Hui Liu ◽

Yufei Huang ◽

...

Keyword(s):

Clustering Analysis ◽

Methylation Level ◽

Optimal Number ◽

Generative Model ◽

Methylation Data ◽

Sequencing Data ◽

Number Of Clusters ◽

Rna Methylation ◽

Clustering Effect ◽

Optimal Number Of Clusters

Background: RNA methylome has been discovered as an important layer of gene regulation and can be profiled directly with count-based measurements from high-throughput sequencing data. Although the detailed regulatory circuit of the epitranscriptome remains uncharted, clustering effect in methylation status among different RNA methylation sites can be identified from transcriptome-wide RNA methylation profiles and may reflect the epitranscriptomic regulation. Count-based RNA methylation sequencing data has unique features, such as low reads coverage, which calls for novel clustering approaches. Objective: Besides the low reads coverage, it is also necessary to keep the integer property to approach clustering analysis of count-based RNA methylation sequencing data. Method: We proposed a nonparametric generative model together with its Gibbs sampling solution for clustering analysis. The proposed approach implements a beta-binomial mixture model to capture the clustering effect in methylation level with the original count-based measurements rather than an estimated continuous methylation level. Besides, it adopts a nonparametric Dirichlet process to automatically determine an optimal number of clusters so as to avoid the common model selection problem in clustering analysis. Results: When tested on the simulated system, the method demonstrated improved clustering performance over hierarchical clustering, K-means, MClust, NMF and EMclust. It also revealed on real dataset two novel RNA N6-methyladenosine (m6A) co-methylation patterns that may be induced directly by METTL14 and WTAP, which are two known regulatory components of the RNA m6A methyltransferase complex. Conclusion: Our proposed DPBBM method not only properly handles the count-based measurements of RNA methylation data from sites of very low reads coverage, but also learns an optimal number of clusters adaptively from the data analyzed. Availability: The source code and documents of DPBBM R package are freely available through the Comprehensive R Archive Network (CRAN): https://cran.r-project.org/web/packages/DPBBM/.

Download Full-text