Estimating the Optimal Number of Clusters in Categorical Data Clustering by Silhouette Coefficient

In this paper, a multi-objective genetic algorithm for data clustering based on the robust fuzzy least trimmed squares estimator is presented. The proposed clustering methodology addresses two critical issues in unsupervised data clustering – the ability to produce meaningful partition in noisy data, and the requirement that the number of clusters be known a priori. The multi-objective genetic algorithm-driven clustering technique optimizes the number of clusters as well as cluster assignment, and cluster prototypes. A two-parameter, mapped, fixed point coding scheme is used to represent assignment of data into the true retained set and the noisy trimmed set, and the optimal number of clusters in the retained set. A three-objective criterion is also used as the minimization functional for the multi-objective genetic algorithm. Results on well-known data sets from literature suggest that the proposed methodology is superior to conventional fuzzy clustering algorithms that assume a known value for optimal number of clusters.

Download Full-text

Understanding and Enhancement of Internal Clustering Validation Indexes for Categorical Data

Algorithms ◽

10.3390/a11110177 ◽

2018 ◽

Vol 11 (11) ◽

pp. 177 ◽

Cited By ~ 2

Author(s):

Xuedong Gao ◽

Minghan Yang

Keyword(s):

Machine Learning ◽

Categorical Data ◽

Data Clustering ◽

Information Gain ◽

Clustering Algorithms ◽

Number Of Clusters ◽

Cluster Compactness ◽

Clustering Validation ◽

Categorical Data Clustering

Clustering is one of the main tasks of machine learning. Internal clustering validation indexes (CVIs) are used to measure the quality of several clustered partitions to determine the local optimal clustering results in an unsupervised manner, and can act as the objective function of clustering algorithms. In this paper, we first studied several well-known internal CVIs for categorical data clustering, and proved the ineffectiveness of evaluating the partitions of different numbers of clusters without any inter-cluster separation measures or assumptions; the accurateness of separation, along with its coordination with the intra-cluster compactness measures, can notably affect performance. Then, aiming to enhance the internal clustering validation measurement, we proposed a new internal CVI—clustering utility based on the averaged information gain of isolating each cluster (CUBAGE)—which measures both the compactness and the separation of the partition. The experimental results supported our findings with regard to the existing internal CVIs, and showed that the proposed CUBAGE outperforms other internal CVIs with or without a pre-known number of clusters.

Download Full-text

PENERAPAN k-MODES CLUSTERING DENGAN VALIDASI DUNN INDEX PADA PENGELOMPOKAN KARAKTERISTIK CALON TKI MENGGUNAKAN R-GUI

Jurnal Gaussian ◽

10.14710/j.gauss.v10i3.32790 ◽

2021 ◽

Vol 10 (3) ◽

pp. 359-366

Author(s):

Hanik Malikhatin ◽

Agus Rusgiyono ◽

Di Asih I Maruddani

Keyword(s):

User Interface ◽

Graphical User Interface ◽

Categorical Data ◽

Optimal Number ◽

Class I ◽

Number Of Clusters ◽

The Government ◽

Index Value ◽

Cluster 2 ◽

Optimal Number Of Clusters

Prospective TKI workers who apply for passports at the Immigration Office Class I Non TPI Pati have countries destinations and choose different PPTKIS agencies. Therefore, the grouping of characteristics prospective TKI needed so that can be used as a reference for the government in an effort to improve the protection of TKI in destination countries and carry out stricter supervision of PPTKIS who manage TKI. The purpose of this research is to classify the characteristics of prospective TKI workers with the optimal number of clusters. The method used is k-Modes Clustering with values of k = 2, 3, 4, and 5. This method can agglomerate categorical data. The optimal number of clusters can be determined using the Dunn Index. For grouping data easily, then compiled a Graphical User Interface (GUI) based application with RStudio. Based on the analysis, the optimal number of clusters is two clusters with a Dunn Index value of 0,4. Cluster 1 consists of mostly male TKI workers (51,04%), aged ≥ 20 years old (91,93%), with the destination Malaysia country (47%), and choosing PPTKIS Surya Jaya Utama Abadi (37,51%), while cluster 2, mostly of male TKI workers (94,10%), aged ≥ 20 years old (82,31%), with the destination Korea Selatan country (77,95%), and choosing PPTKIS BNP2TKI (99,78%).

Download Full-text

Internal evaluation criteria for categorical data in hierarchical clustering

Advances in Methodology and Statistics ◽

10.51936/lxut1974 ◽

2018 ◽

Vol 15 (2) ◽

Author(s):

Zdeněk Šulc ◽

Jana Cibulková ◽

Jiří Procházka ◽

Hana Řezanková

Keyword(s):

Hierarchical Clustering ◽

Categorical Data ◽

Likelihood Function ◽

Evaluation Criteria ◽

Optimal Number ◽

Number Of Clusters ◽

Internal Evaluation ◽

Correct Number ◽

Cluster Quality ◽

Optimal Number Of Clusters

The paper compares 11 internal evaluation criteria for hierarchical clustering of categorical data regarding a correct number of clusters determination. The criteria are divided into three groups based on a way of treating the cluster quality. The variability-based criteria use the within-cluster variability, the likelihood-based criteria maximize the likelihood function, and the distance-based criteria use distances within and between clusters. The aim is to determine which evaluation criteria perform well and under what conditions. Different analysis settings, such as the used method of hierarchical clustering, and various dataset properties, such as the number of variables or the minimal between-cluster distances, are examined. The experiment is conducted on 810 generated datasets, where the evaluation criteria are assessed regarding the optimal number of clusters determination and mean absolute errors. The results indicate that the likelihood-based BIC1 and variability-based BK criteria perform relatively well in determining the optimal number of clusters and that some criteria, usually the distance-based ones, should be avoided.

Download Full-text

A Dimensionality reduced Text data clustering with prediction of optimal number of clusters

International Journal of Applied Research on Information Technology and Computing ◽

10.5958/j.0975-8070.2.2.010 ◽

2011 ◽

Vol 2 (2) ◽

pp. 41 ◽

Cited By ~ 3

Author(s):

M. Ramakrishna Murty ◽

JVR Murthy ◽

Prasad Reddy ◽

Suresh Chandra Satapathy

Keyword(s):

Data Clustering ◽

Optimal Number ◽

Text Data ◽

Number Of Clusters ◽

Optimal Number Of Clusters

Download Full-text

Designing new sizing bulletproof vests for Taiwanese soldiers

International Journal of Clothing Science and Technology ◽

10.1108/ijcst-09-2019-0150 ◽

2020 ◽

Vol ahead-of-print (ahead-of-print) ◽

Author(s):

Chih-Hao Wen ◽

Yuh-Chuan Shih

Keyword(s):

Human Body ◽

Optimal Number ◽

Number Of Clusters ◽

Male And Female ◽

Content Type ◽

Research Results ◽

Silhouette Coefficient ◽

Bulletproof Vest ◽

Female Soldiers ◽

Optimal Number Of Clusters

PurposeCombining the collected human body variables by a 3D body scanner and the research results of medical computed tomography (CT) imaging, this research aims to develop a military bulletproof vest that is both protective and fit. In particular, the protective part must be able to cover the vital human internal organs completely. The results of this research help to make military bulletproof vests of different sizes for Taiwanese male and female soldiers. At the same time, the research results can provide a reference for the industry of making special-purpose clothing.Design/methodology/approach17 important human body variables of 988 participants (male: 716, 72.5%; female 272, 27.5%) are used for the analysis. The K-means algorithm firstly builds clusters of different body shapes for both sexes; the silhouette coefficient helps to determine the optimal number of clusters to be six. Thus, the standard size of the bulletproof vest for soldiers is determined. The specifications of the bulletproof vest's inner core and textile vest are calculated for each cluster user. Our research then makes twelve prototypes of the bulletproof vest. After that, 12 subjects are invited to try on the new version (the vest designed in this study) and the old version (the vest currently used) to contrast the differences between the two.FindingsAccording to the index of the silhouette coefficient, the optimal number of clusters is determined to be six for both male and female clusters. Therefore, this study has designed six sizes of the bulletproof vest for male and female soldiers in Taiwan. After trying the new and old vests on, the subjects all indicate that the new vest fits better than the old one. In addition, the coverage of the bulletproof vest designed in this study is 94.38% for male users and 92.75% for female users.Originality/valueThe design of bulletproof vests must take note of the fit of the clothing itself and its protective function. Apart from the size design of general clothing only focusing on the human shape exteriorly, the bulletproof vest also needs to pay attention to the relative positions of vital organs inside the human body. Besides, for practical applications, it is quite effective to use the silhouette coefficient to determine the results of cluster analysis. Thus, the value of this research lies in the cross-field combination, enabling the integration of body measurement, data science and clothing design. Generally, bulletproof vests of newly designed sizes can meet the requirements of Taiwan's military. The research results can be used in the development of various military clothing for Taiwanese military personnel. At the same time, the results can be provided to the clothing industry as relevant parameters for designing unique functional clothing.

Download Full-text

Method for determining optimal number of clusters in K-means clustering algorithm

Journal of Computer Applications ◽

10.3724/sp.j.1087.2010.01995 ◽

2010 ◽

Vol 30 (8) ◽

pp. 1995-1998 ◽

Cited By ~ 18

Author(s):

Shi-bing ZHOU ◽

Zhen-yuan XU ◽

Xu-qing TANG

Keyword(s):

Clustering Algorithm ◽

Optimal Number ◽

Number Of Clusters ◽

Optimal Number Of Clusters

Download Full-text

Clustering Count-based RNA Methylation Data Using a Nonparametric Generative Model

Current Bioinformatics ◽

10.2174/1574893613666180601080008 ◽

2018 ◽

Vol 14 (1) ◽

pp. 11-23 ◽

Cited By ~ 3

Author(s):

Lin Zhang ◽

Yanling He ◽

Huaizhi Wang ◽

Hui Liu ◽

Yufei Huang ◽

...

Keyword(s):

Clustering Analysis ◽

Methylation Level ◽

Optimal Number ◽

Generative Model ◽

Methylation Data ◽

Sequencing Data ◽

Number Of Clusters ◽

Rna Methylation ◽

Clustering Effect ◽

Optimal Number Of Clusters

Background: RNA methylome has been discovered as an important layer of gene regulation and can be profiled directly with count-based measurements from high-throughput sequencing data. Although the detailed regulatory circuit of the epitranscriptome remains uncharted, clustering effect in methylation status among different RNA methylation sites can be identified from transcriptome-wide RNA methylation profiles and may reflect the epitranscriptomic regulation. Count-based RNA methylation sequencing data has unique features, such as low reads coverage, which calls for novel clustering approaches. Objective: Besides the low reads coverage, it is also necessary to keep the integer property to approach clustering analysis of count-based RNA methylation sequencing data. Method: We proposed a nonparametric generative model together with its Gibbs sampling solution for clustering analysis. The proposed approach implements a beta-binomial mixture model to capture the clustering effect in methylation level with the original count-based measurements rather than an estimated continuous methylation level. Besides, it adopts a nonparametric Dirichlet process to automatically determine an optimal number of clusters so as to avoid the common model selection problem in clustering analysis. Results: When tested on the simulated system, the method demonstrated improved clustering performance over hierarchical clustering, K-means, MClust, NMF and EMclust. It also revealed on real dataset two novel RNA N6-methyladenosine (m6A) co-methylation patterns that may be induced directly by METTL14 and WTAP, which are two known regulatory components of the RNA m6A methyltransferase complex. Conclusion: Our proposed DPBBM method not only properly handles the count-based measurements of RNA methylation data from sites of very low reads coverage, but also learns an optimal number of clusters adaptively from the data analyzed. Availability: The source code and documents of DPBBM R package are freely available through the Comprehensive R Archive Network (CRAN): https://cran.r-project.org/web/packages/DPBBM/.

Download Full-text