A new cluster validity index using maximum cluster spread based compactness measure

Purpose – The most commonly used approaches for cluster validation are based on indices but the majority of the existing cluster validity indices do not work well on data sets of different complexities. The purpose of this paper is to propose a new cluster validity index (ARSD index) that works well on all types of data sets. Design/methodology/approach – The authors introduce a new compactness measure that depicts the typical behaviour of a cluster where more points are located around the centre and lesser points towards the outer edge of the cluster. A novel penalty function is proposed for determining the distinctness measure of clusters. Random linear search-algorithm is employed to evaluate and compare the performance of the five commonly known validity indices and the proposed validity index. The values of the six indices are computed for all nc ranging from (nc min, nc max) to obtain the optimal number of clusters present in a data set. The data sets used in the experiments include shaped, Gaussian-like and real data sets. Findings – Through extensive experimental study, it is observed that the proposed validity index is found to be more consistent and reliable in indicating the correct number of clusters compared to other validity indices. This is experimentally demonstrated on 11 data sets where the proposed index has achieved better results. Originality/value – The originality of the research paper includes proposing a novel cluster validity index which is used to determine the optimal number of clusters present in data sets of different complexities.

Download Full-text

A classification approach based on variable precision rough sets and cluster validity index function

Engineering Computations ◽

10.1108/ec-11-2012-0297 ◽

2014 ◽

Vol 31 (8) ◽

pp. 1778-1789

Author(s):

Hongkang Lin

Keyword(s):

Optimal Number ◽

Data Sets ◽

Cluster Validity ◽

Cluster Validity Index ◽

Index Method ◽

Data Set ◽

Content Type ◽

The Individual ◽

Variable Precision Rough Sets ◽

Optimal Number Of Clusters

Purpose – The clustering/classification method proposed in this study, designated as the PFV-index method, provides the means to solve the following problems for a data set characterized by imprecision and uncertainty: first, discretizing the continuous values of all the individual attributes within a data set; second, evaluating the optimality of the discretization results; third, determining the optimal number of clusters per attribute; and fourth, improving the classification accuracy (CA) of data sets characterized by uncertainty. The paper aims to discuss these issues. Design/methodology/approach – The proposed method for the solution of the clustering/classifying problem, designated as PFV-index method, combines a particle swarm optimization algorithm, fuzzy C-means method, variable precision rough sets theory, and a new cluster validity index function. Findings – This method could cluster the values of the individual attributes within the data set and achieves both the optimal number of clusters and the optimal CA. Originality/value – The validity of the proposed approach is investigated by comparing the classification results obtained for UCI data sets with those obtained by supervised classification BPNN, decision-tree methods.

Download Full-text

Fast Search Algorithm for Determining the Optimal Number of Clusters using Cluster Validity Index

The Journal of the Korea Contents Association ◽

10.5392/jkca.2009.9.9.080 ◽

2009 ◽

Vol 9 (9) ◽

pp. 80-89 ◽

Cited By ~ 1

Author(s):

Sang-Wook Lee

Keyword(s):

Search Algorithm ◽

Optimal Number ◽

Cluster Validity ◽

Cluster Validity Index ◽

Validity Index ◽

Number Of Clusters ◽

Fast Search ◽

Fast Search Algorithm ◽

Optimal Number Of Clusters

Download Full-text

Enhanced cluster validity index for the evaluation of optimal number of clusters for Fuzzy C-Means algorithm

2014 IEEE International Conference on Fuzzy Systems (FUZZ-IEEE) ◽

10.1109/fuzz-ieee.2014.6891591 ◽

2014 ◽

Cited By ~ 10

Author(s):

Neha Bharill ◽

Aruna Tiwari

Keyword(s):

Optimal Number ◽

Cluster Validity ◽

Cluster Validity Index ◽

Validity Index ◽

Number Of Clusters ◽

Fuzzy C Means ◽

Fuzzy C Means Algorithm ◽

Optimal Number Of Clusters

Download Full-text

A Validity Index for Fuzzy Clustering Based on Bipartite Modularity

Journal of Electrical and Computer Engineering ◽

10.1155/2019/2719617 ◽

2019 ◽

Vol 2019 ◽

pp. 1-9 ◽

Cited By ~ 1

Author(s):

Yongli Liu ◽

Xiaoyang Zhang ◽

Jingli Chen ◽

Hao Chao

Keyword(s):

Fuzzy Clustering ◽

Optimal Number ◽

Experimental Results ◽

Validity Index ◽

Number Of Clusters ◽

Validity Indices ◽

Noise Data ◽

Clustering Validity ◽

Optimal Number Of Clusters

Because traditional fuzzy clustering validity indices need to specify the number of clusters and are sensitive to noise data, we propose a validity index for fuzzy clustering, named CSBM (compactness separateness bipartite modularity), based on bipartite modularity. CSBM enhances the robustness by combining intraclass compactness and interclass separateness and can automatically determine the optimal number of clusters. In order to estimate the performance of CSBM, we carried out experiments on six real datasets and compared CSBM with other six prominent indices. Experimental results show that the CSBM index performs the best in terms of robustness while accurately detecting the number of clusters.

Download Full-text

A novel fuzzy clustering approach to regionalise watersheds with an automatic determination of optimal number of clusters

Journal of Hydrology and Hydromechanics ◽

10.1515/johh-2017-0024 ◽

2017 ◽

Vol 65 (4) ◽

pp. 359-365 ◽

Cited By ~ 1

Author(s):

Javier Senent-Aparicio ◽

Jesús Soto ◽

Julio Pérez-Sánchez ◽

Jorge Garrido

Keyword(s):

Frequency Analysis ◽

Fuzzy Clustering ◽

Optimal Number ◽

Regional Frequency Analysis ◽

Cluster Validity ◽

Number Of Clusters ◽

Cluster Validity Indices ◽

Validity Indices ◽

Homogeneous Regions ◽

Optimal Number Of Clusters

AbstractOne of the most important problems faced in hydrology is the estimation of flood magnitudes and frequencies in ungauged basins. Hydrological regionalisation is used to transfer information from gauged watersheds to ungauged watersheds. However, to obtain reliable results, the watersheds involved must have a similar hydrological behaviour. In this study, two different clustering approaches are used and compared to identify the hydrologically homogeneous regions. Fuzzy C-Means algorithm (FCM), which is widely used for regionalisation studies, needs the calculation of cluster validity indices in order to determine the optimal number of clusters. Fuzzy Minimals algorithm (FM), which presents an advantage compared with others fuzzy clustering algorithms, does not need to know a priori the number of clusters, so cluster validity indices are not used. Regional homogeneity test based on L-moments approach is used to check homogeneity of regions identified by both cluster analysis approaches. The validation of the FM algorithm in deriving homogeneous regions for flood frequency analysis is illustrated through its application to data from the watersheds in Alto Genil (South Spain). According to the results, FM algorithm is recommended for identifying the hydrologically homogeneous regions for regional frequency analysis.

Download Full-text

Cluster Validity Index to Determine the Optimal Number Clusters of Fuzzy Clustering for Classify Customer Buying Behavior

Journal of Development Research ◽

10.28926/jdr.v5i1.134 ◽

2021 ◽

Vol 5 (1) ◽

pp. 7-12

Author(s):

Salnan Ratih Asrriningtias

Keyword(s):

Fuzzy Clustering ◽

Optimal Number ◽

Buying Behavior ◽

Cluster Validity ◽

Cluster Validity Index ◽

Validity Index ◽

Number Of Clusters ◽

Best Value ◽

Fuzzy Clustering Method ◽

The Right

One of the strategies in order to compete in Batik MSMEs is to look at the characteristics of the customer. To make it easier to see the characteristics of customer buying behavior, it is necessary to classify customers based on similarity of characteristics using fuzzy clustering. One of the parameters that must be determined at the beginning of the fuzzy clustering method is the number of clusters. Increasing the number of clusters does not guarantee the best performance, but the right number of clusters greatly affects the performance of fuzzy clustering. So to get optimal number cluster, we can measured the result of clustering in each number cluster using the cluster validity index. From several types of cluster validity index, NPC give the best value. Optimal number cluster that obtained by the validity index is 2 and this number cluster give classify result with small variance value

Download Full-text

Role of Cluster Validity Indices in Delineation of Precipitation Regions

Water ◽

10.3390/w12051372 ◽

2020 ◽

Vol 12 (5) ◽

pp. 1372

Author(s):

Nikhil Bhatia ◽

Jency M. Sojan ◽

Slobodon Simonovic ◽

Roshan Srivastav

Keyword(s):

Clustering Algorithm ◽

Clustering Algorithms ◽

Optimal Number ◽

Ratio Test ◽

Cluster Validity ◽

Number Of Clusters ◽

Cluster Validity Indices ◽

Validity Indices ◽

Point Data ◽

Optimal Number Of Clusters

The delineation of precipitation regions is to identify homogeneous zones in which the characteristics of the process are statistically similar. The regionalization process has three main components: (i) delineation of regions using clustering algorithms, (ii) determining the optimal number of regions using cluster validity indices (CVIs), and (iii) validation of regions for homogeneity using L-moments ratio test. The identification of the optimal number of clusters will significantly affect the homogeneity of the regions. The objective of this study is to investigate the performance of the various CVIs in identifying the optimal number of clusters, which maximizes the homogeneity of the precipitation regions. The k-means clustering algorithm is adopted to delineate the regions using location-based attributes for two large areas from Canada, namely, the Prairies and the Great Lakes-St Lawrence lowlands (GL-SL) region. The seasonal precipitation data for 55 years (1951–2005) is derived using high-resolution ANUSPLIN gridded point data for Canada. The results indicate that the optimal number of clusters and the regional homogeneity depends on the CVI adopted. Among 42 cluster indices considered, 15 of them outperform in identifying the homogeneous precipitation regions. The Dunn, D e t _ r a t i o and Trace( W − 1 B ) indices found to be the best for all seasons in both the regions.

Download Full-text

Estimating the Optimal Number of Clusters Via Internal Validity Index

Neural Processing Letters ◽

10.1007/s11063-021-10427-8 ◽

2021 ◽

Author(s):

Shibing Zhou ◽

Fei Liu ◽

Wei Song

Keyword(s):

Internal Validity ◽

Optimal Number ◽

Validity Index ◽

Number Of Clusters ◽

Optimal Number Of Clusters

Download Full-text

Investigating cluster validation metrics for optimal number of clusters determination

Intelligent Decision Technologies ◽

10.3233/idt-210187 ◽

2021 ◽

pp. 1-16

Author(s):

Aikaterini Karanikola ◽

Charalampos M. Liapis ◽

Sotiris Kotsiantis

Keyword(s):

Real World ◽

Optimal Number ◽

Cluster Validation ◽

Clustering Methods ◽

Number Of Clusters ◽

Validity Indices ◽

Selection Of ◽

Specific Distance ◽

Optimal Number Of Clusters

In short, clustering is the process of partitioning a given set of objects into groups containing highly related instances. This relation is determined by a specific distance metric with which the intra-cluster similarity is estimated. Finding an optimal number of such partitions is usually the key step in the entire process, yet a rather difficult one. Selecting an unsuitable number of clusters might lead to incorrect conclusions and, consequently, to wrong decisions: the term “optimal” is quite ambiguous. Furthermore, various inherent characteristics of the datasets, such as clusters that overlap or clusters containing subclusters, will most often increase the level of difficulty of the task. Thus, the methods used to detect similarities and the parameter selection of the partition algorithm have a major impact on the quality of the groups and the identification of their optimal number. Given that each dataset constitutes a rather distinct case, validity indices are indicators introduced to address the problem of selecting such an optimal number of clusters. In this work, an extensive set of well-known validity indices, based on the approach of the so-called relative criteria, are examined comparatively. A total of 26 cluster validation measures were investigated in two distinct case studies: one in real-world and one in artificially generated data. To ensure a certain degree of difficulty, both real-world and generated data were selected to exhibit variations and inhomogeneity. Each of the indices is being deployed under the schemes of 9 different clustering methods, which incorporate 5 different distance metrics. All results are presented in various explanatory forms.

Download Full-text

Models for Internal Clustering Validation Indexes Based on Hadoop-MapReduce

International Journal of Distributed Systems and Technologies ◽

10.4018/ijdst.2020070103 ◽

2020 ◽

Vol 11 (3) ◽

pp. 42-67

Author(s):

Soumeya Zerabi ◽

Souham Meshoul ◽

Samia Chikhi Boucherkha

Keyword(s):

Clustering Algorithms ◽

Large Data ◽

Optimal Number ◽

Data Sets ◽

Data Set ◽

Number Of Clusters ◽

Distributed Models ◽

Hadoop Mapreduce ◽

Distributed Solutions ◽

Clustering Validation

Cluster validation aims to both evaluate the results of clustering algorithms and predict the number of clusters. It is usually achieved using several indexes. Traditional internal clustering validation indexes (CVIs) are mainly based in computing pairwise distances which results in a quadratic complexity of the related algorithms. The existing CVIs cannot handle large data sets properly and need to be revisited to take account of the ever-increasing data set volume. Therefore, design of parallel and distributed solutions to implement these indexes is required. To cope with this issue, the authors propose two parallel and distributed models for internal CVIs namely for Silhouette and Dunn indexes using MapReduce framework under Hadoop. The proposed models termed as MR_Silhouette and MR_Dunn have been tested to solve both the issue of evaluating the clustering results and identifying the optimal number of clusters. The results of experimental study are very promising and show that the proposed parallel and distributed models achieve the expected tasks successfully.

Download Full-text