Estimating the Optimal Number of Clusters in Categorical Data Clustering by Silhouette Coefficient

Author(s):  
Duy-Tai Dinh ◽  
Tsutomu Fujinami ◽  
Van-Nam Huynh
2012 ◽  
Vol 3 (1) ◽  
pp. 1-20
Author(s):  
Amit Banerjee

In this paper, a multi-objective genetic algorithm for data clustering based on the robust fuzzy least trimmed squares estimator is presented. The proposed clustering methodology addresses two critical issues in unsupervised data clustering – the ability to produce meaningful partition in noisy data, and the requirement that the number of clusters be known a priori. The multi-objective genetic algorithm-driven clustering technique optimizes the number of clusters as well as cluster assignment, and cluster prototypes. A two-parameter, mapped, fixed point coding scheme is used to represent assignment of data into the true retained set and the noisy trimmed set, and the optimal number of clusters in the retained set. A three-objective criterion is also used as the minimization functional for the multi-objective genetic algorithm. Results on well-known data sets from literature suggest that the proposed methodology is superior to conventional fuzzy clustering algorithms that assume a known value for optimal number of clusters.


Algorithms ◽  
2018 ◽  
Vol 11 (11) ◽  
pp. 177 ◽  
Author(s):  
Xuedong Gao ◽  
Minghan Yang

Clustering is one of the main tasks of machine learning. Internal clustering validation indexes (CVIs) are used to measure the quality of several clustered partitions to determine the local optimal clustering results in an unsupervised manner, and can act as the objective function of clustering algorithms. In this paper, we first studied several well-known internal CVIs for categorical data clustering, and proved the ineffectiveness of evaluating the partitions of different numbers of clusters without any inter-cluster separation measures or assumptions; the accurateness of separation, along with its coordination with the intra-cluster compactness measures, can notably affect performance. Then, aiming to enhance the internal clustering validation measurement, we proposed a new internal CVI—clustering utility based on the averaged information gain of isolating each cluster (CUBAGE)—which measures both the compactness and the separation of the partition. The experimental results supported our findings with regard to the existing internal CVIs, and showed that the proposed CUBAGE outperforms other internal CVIs with or without a pre-known number of clusters.


2021 ◽  
Vol 10 (3) ◽  
pp. 359-366
Author(s):  
Hanik Malikhatin ◽  
Agus Rusgiyono ◽  
Di Asih I Maruddani

Prospective TKI workers who apply for passports at the Immigration Office Class I Non TPI Pati have countries destinations and choose different PPTKIS agencies. Therefore, the grouping of characteristics prospective TKI needed so that can be used as a reference for the government in an effort to improve the protection of TKI in destination countries and carry out stricter supervision of PPTKIS who manage TKI. The purpose of this research is to classify the characteristics of prospective TKI workers with the optimal number of clusters. The method used is k-Modes Clustering with values of k = 2, 3, 4, and 5. This method can agglomerate categorical data. The optimal number of clusters can be determined using the Dunn Index. For grouping data easily, then compiled a Graphical User Interface (GUI) based application with RStudio. Based on the analysis, the optimal number of clusters is two clusters with a Dunn Index value of 0,4. Cluster 1 consists of mostly male TKI workers (51,04%), aged ≥ 20 years old (91,93%), with the destination Malaysia country (47%), and choosing PPTKIS Surya Jaya Utama Abadi (37,51%), while cluster 2, mostly of male TKI workers (94,10%), aged ≥ 20 years old (82,31%), with the destination Korea Selatan country (77,95%), and choosing PPTKIS BNP2TKI (99,78%). 


2018 ◽  
Vol 15 (2) ◽  
Author(s):  
Zdeněk Šulc ◽  
Jana Cibulková ◽  
Jiří Procházka ◽  
Hana Řezanková

The paper compares 11 internal evaluation criteria for hierarchical clustering of categorical data regarding a correct number of clusters determination. The criteria are divided into three groups based on a way of treating the cluster quality. The variability-based criteria use the within-cluster variability, the likelihood-based criteria maximize the likelihood function, and the distance-based criteria use distances within and between clusters. The aim is to determine which evaluation criteria perform well and under what conditions. Different analysis settings, such as the used method of hierarchical clustering, and various dataset properties, such as the number of variables or the minimal between-cluster distances, are examined. The experiment is conducted on 810 generated datasets, where the evaluation criteria are assessed regarding the optimal number of clusters determination and mean absolute errors. The results indicate that the likelihood-based BIC1 and variability-based BK criteria perform relatively well in determining the optimal number of clusters and that some criteria, usually the distance-based ones, should be avoided.


2020 ◽  
Vol ahead-of-print (ahead-of-print) ◽  
Author(s):  
Chih-Hao Wen ◽  
Yuh-Chuan Shih

PurposeCombining the collected human body variables by a 3D body scanner and the research results of medical computed tomography (CT) imaging, this research aims to develop a military bulletproof vest that is both protective and fit. In particular, the protective part must be able to cover the vital human internal organs completely. The results of this research help to make military bulletproof vests of different sizes for Taiwanese male and female soldiers. At the same time, the research results can provide a reference for the industry of making special-purpose clothing.Design/methodology/approach17 important human body variables of 988 participants (male: 716, 72.5%; female 272, 27.5%) are used for the analysis. The K-means algorithm firstly builds clusters of different body shapes for both sexes; the silhouette coefficient helps to determine the optimal number of clusters to be six. Thus, the standard size of the bulletproof vest for soldiers is determined. The specifications of the bulletproof vest's inner core and textile vest are calculated for each cluster user. Our research then makes twelve prototypes of the bulletproof vest. After that, 12 subjects are invited to try on the new version (the vest designed in this study) and the old version (the vest currently used) to contrast the differences between the two.FindingsAccording to the index of the silhouette coefficient, the optimal number of clusters is determined to be six for both male and female clusters. Therefore, this study has designed six sizes of the bulletproof vest for male and female soldiers in Taiwan. After trying the new and old vests on, the subjects all indicate that the new vest fits better than the old one. In addition, the coverage of the bulletproof vest designed in this study is 94.38% for male users and 92.75% for female users.Originality/valueThe design of bulletproof vests must take note of the fit of the clothing itself and its protective function. Apart from the size design of general clothing only focusing on the human shape exteriorly, the bulletproof vest also needs to pay attention to the relative positions of vital organs inside the human body. Besides, for practical applications, it is quite effective to use the silhouette coefficient to determine the results of cluster analysis. Thus, the value of this research lies in the cross-field combination, enabling the integration of body measurement, data science and clothing design. Generally, bulletproof vests of newly designed sizes can meet the requirements of Taiwan's military. The research results can be used in the development of various military clothing for Taiwanese military personnel. At the same time, the results can be provided to the clothing industry as relevant parameters for designing unique functional clothing.


2018 ◽  
Vol 14 (1) ◽  
pp. 11-23 ◽  
Author(s):  
Lin Zhang ◽  
Yanling He ◽  
Huaizhi Wang ◽  
Hui Liu ◽  
Yufei Huang ◽  
...  

Background: RNA methylome has been discovered as an important layer of gene regulation and can be profiled directly with count-based measurements from high-throughput sequencing data. Although the detailed regulatory circuit of the epitranscriptome remains uncharted, clustering effect in methylation status among different RNA methylation sites can be identified from transcriptome-wide RNA methylation profiles and may reflect the epitranscriptomic regulation. Count-based RNA methylation sequencing data has unique features, such as low reads coverage, which calls for novel clustering approaches. <P><P> Objective: Besides the low reads coverage, it is also necessary to keep the integer property to approach clustering analysis of count-based RNA methylation sequencing data. <P><P> Method: We proposed a nonparametric generative model together with its Gibbs sampling solution for clustering analysis. The proposed approach implements a beta-binomial mixture model to capture the clustering effect in methylation level with the original count-based measurements rather than an estimated continuous methylation level. Besides, it adopts a nonparametric Dirichlet process to automatically determine an optimal number of clusters so as to avoid the common model selection problem in clustering analysis. <P><P> Results: When tested on the simulated system, the method demonstrated improved clustering performance over hierarchical clustering, K-means, MClust, NMF and EMclust. It also revealed on real dataset two novel RNA N6-methyladenosine (m6A) co-methylation patterns that may be induced directly by METTL14 and WTAP, which are two known regulatory components of the RNA m6A methyltransferase complex. <P><P> Conclusion: Our proposed DPBBM method not only properly handles the count-based measurements of RNA methylation data from sites of very low reads coverage, but also learns an optimal number of clusters adaptively from the data analyzed. <P><P> Availability: The source code and documents of DPBBM R package are freely available through the Comprehensive R Archive Network (CRAN): https://cran.r-project.org/web/packages/DPBBM/.


Sign in / Sign up

Export Citation Format

Share Document