A New Algorithm for Fuzzy Clustering Handling Incomplete Dataset

One of the most difficult problems in cluster analysis is the identification of the number of groups in a dataset especially in the presence of missing value. Since traditional clustering methods assumed the real number of clusters to be known. However, in real world applications the number of clusters is generally not known a priori. Also, most of clustering methods were developed to analyse complete datasets, they cannot be applied to many practical problems, e.g., on incomplete data. This paper focuses, first, on an algorithm of a fuzzy clustering approach, called OCS-FSOM. The proposed algorithm is based on neural network and uses Optimal Completion Strategy for missing value estimation in incomplete dataset. Then, we propose an extension of our algorithm, to tackle the problem of estimating the number of clusters, by using a multi level OCS-FSOM method. The new algorithm called Multi-OCSFSOM is able to find the optimal number of clusters by using a statistical criterion, that aims at measuring the quality of obtained partitions. Carried out experiments on real-life datasets highlights a very encouraging results in terms of exact determination of optimal number of clusters.

Download Full-text

Investigating cluster validation metrics for optimal number of clusters determination

Intelligent Decision Technologies ◽

10.3233/idt-210187 ◽

2021 ◽

pp. 1-16

Author(s):

Aikaterini Karanikola ◽

Charalampos M. Liapis ◽

Sotiris Kotsiantis

Keyword(s):

Real World ◽

Optimal Number ◽

Cluster Validation ◽

Clustering Methods ◽

Number Of Clusters ◽

Validity Indices ◽

Selection Of ◽

Specific Distance ◽

Optimal Number Of Clusters

In short, clustering is the process of partitioning a given set of objects into groups containing highly related instances. This relation is determined by a specific distance metric with which the intra-cluster similarity is estimated. Finding an optimal number of such partitions is usually the key step in the entire process, yet a rather difficult one. Selecting an unsuitable number of clusters might lead to incorrect conclusions and, consequently, to wrong decisions: the term “optimal” is quite ambiguous. Furthermore, various inherent characteristics of the datasets, such as clusters that overlap or clusters containing subclusters, will most often increase the level of difficulty of the task. Thus, the methods used to detect similarities and the parameter selection of the partition algorithm have a major impact on the quality of the groups and the identification of their optimal number. Given that each dataset constitutes a rather distinct case, validity indices are indicators introduced to address the problem of selecting such an optimal number of clusters. In this work, an extensive set of well-known validity indices, based on the approach of the so-called relative criteria, are examined comparatively. A total of 26 cluster validation measures were investigated in two distinct case studies: one in real-world and one in artificially generated data. To ensure a certain degree of difficulty, both real-world and generated data were selected to exhibit variations and inhomogeneity. Each of the indices is being deployed under the schemes of 9 different clustering methods, which incorporate 5 different distance metrics. All results are presented in various explanatory forms.

Download Full-text

A Validity Index for Fuzzy Clustering Based on Bipartite Modularity

Journal of Electrical and Computer Engineering ◽

10.1155/2019/2719617 ◽

2019 ◽

Vol 2019 ◽

pp. 1-9 ◽

Cited By ~ 1

Author(s):

Yongli Liu ◽

Xiaoyang Zhang ◽

Jingli Chen ◽

Hao Chao

Keyword(s):

Fuzzy Clustering ◽

Optimal Number ◽

Experimental Results ◽

Validity Index ◽

Number Of Clusters ◽

Validity Indices ◽

Noise Data ◽

Clustering Validity ◽

Optimal Number Of Clusters

Because traditional fuzzy clustering validity indices need to specify the number of clusters and are sensitive to noise data, we propose a validity index for fuzzy clustering, named CSBM (compactness separateness bipartite modularity), based on bipartite modularity. CSBM enhances the robustness by combining intraclass compactness and interclass separateness and can automatically determine the optimal number of clusters. In order to estimate the performance of CSBM, we carried out experiments on six real datasets and compared CSBM with other six prominent indices. Experimental results show that the CSBM index performs the best in terms of robustness while accurately detecting the number of clusters.

Download Full-text

A novel fuzzy clustering approach to regionalise watersheds with an automatic determination of optimal number of clusters

Journal of Hydrology and Hydromechanics ◽

10.1515/johh-2017-0024 ◽

2017 ◽

Vol 65 (4) ◽

pp. 359-365 ◽

Cited By ~ 1

Author(s):

Javier Senent-Aparicio ◽

Jesús Soto ◽

Julio Pérez-Sánchez ◽

Jorge Garrido

Keyword(s):

Frequency Analysis ◽

Fuzzy Clustering ◽

Optimal Number ◽

Regional Frequency Analysis ◽

Cluster Validity ◽

Number Of Clusters ◽

Cluster Validity Indices ◽

Validity Indices ◽

Homogeneous Regions ◽

Optimal Number Of Clusters

AbstractOne of the most important problems faced in hydrology is the estimation of flood magnitudes and frequencies in ungauged basins. Hydrological regionalisation is used to transfer information from gauged watersheds to ungauged watersheds. However, to obtain reliable results, the watersheds involved must have a similar hydrological behaviour. In this study, two different clustering approaches are used and compared to identify the hydrologically homogeneous regions. Fuzzy C-Means algorithm (FCM), which is widely used for regionalisation studies, needs the calculation of cluster validity indices in order to determine the optimal number of clusters. Fuzzy Minimals algorithm (FM), which presents an advantage compared with others fuzzy clustering algorithms, does not need to know a priori the number of clusters, so cluster validity indices are not used. Regional homogeneity test based on L-moments approach is used to check homogeneity of regions identified by both cluster analysis approaches. The validation of the FM algorithm in deriving homogeneous regions for flood frequency analysis is illustrated through its application to data from the watersheds in Alto Genil (South Spain). According to the results, FM algorithm is recommended for identifying the hydrologically homogeneous regions for regional frequency analysis.

Download Full-text

GIVING FUZZINESS TO SPATIAL CLUSTERS: A NEW INDEX FOR CHOOSING THE OPTIMAL NUMBER OF CLUSTERS

International Journal of Artificial Intelligence Tools ◽

10.1142/s0218213013500097 ◽

2013 ◽

Vol 22 (03) ◽

pp. 1350009 ◽

Cited By ~ 2

Author(s):

GEORGE GREKOUSIS

Keyword(s):

Fuzzy Clustering ◽

Spatial Clustering ◽

Clustering Algorithms ◽

Optimal Number ◽

Fuzzy Cluster ◽

Cluster Validation ◽

Number Of Clusters ◽

A Value ◽

Membership Value ◽

Optimal Number Of Clusters

Choosing the optimal number of clusters is a key issue in cluster analysis. Especially when dealing with more spatial clustering, things tend to be more complicated. Cluster validation helps to determine the appropriate number of clusters present in a dataset. Furthermore, cluster validation evaluates and assesses the results of clustering algorithms. There are numerous methods and techniques for choosing the optimal number of clusters via crisp and fuzzy clustering. In this paper, we introduce a new index for fuzzy clustering to determine the optimal number of clusters. This index is not another metric for calculating compactness or separation among partitions. Instead, the index uses several existing indices to give a degree, or fuzziness, to the optimal number of clusters. In this way, not only do the objects in a fuzzy cluster get a membership value, but the number of clusters to be partitioned is given a value as well. The new index is used in the fuzzy c-means algorithm for the geodemographic segmentation of 285 postal codes.

Download Full-text

A New Algorithm for Fuzzy Clustering Able to Find the Optimal Number of Clusters

2012 IEEE 24th International Conference on Tools with Artificial Intelligence ◽

10.1109/ictai.2012.174 ◽

2012 ◽

Author(s):

A. Balkis ◽

S. B. Yahia ◽

A. Bouzeghoub

Keyword(s):

Fuzzy Clustering ◽

Optimal Number ◽

Number Of Clusters ◽

Optimal Number Of Clusters

Download Full-text

A validity measure for fuzzy clustering and its use in selecting optimal number of clusters

Proceedings of IEEE 5th International Fuzzy Systems ◽

10.1109/fuzzy.1996.552318 ◽

2002 ◽

Cited By ~ 14

Author(s):

Hyun-Sook Rhee ◽

Kyung-Whan Oh

Keyword(s):

Fuzzy Clustering ◽

Optimal Number ◽

Number Of Clusters ◽

Validity Measure ◽

Optimal Number Of Clusters

Download Full-text

Research on Fuzzy Clustering Validity

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.40-41.174 ◽

2010 ◽

Vol 40-41 ◽

pp. 174-182

Author(s):

Wei Jin Chen ◽

Huai Lin Dong ◽

Qing Feng Wu ◽

Ling Lin

Keyword(s):

Cluster Analysis ◽

Fuzzy Clustering ◽

Clustering Analysis ◽

Optimal Number ◽

Number Of Clusters ◽

Fuzzy Partition ◽

Geometry Structure ◽

Clustering Validity ◽

Optimal Number Of Clusters

The evaluation of clustering validity is important for clustering analysis, and is one of the hottest spots of cluster analysis. The quality of the evaluation of clustering is that optimal number of clusters is reasonable. For fuzzy clustering, the paper surveys the widely known fuzzy clustering validity evaluation based on the methods of fuzzy partition, geometry structure and statistics.

Download Full-text

Particle Swarm Optimization Based Fuzzy Clustering Approach to Identify Optimal Number of Clusters

Journal of Artificial Intelligence and Soft Computing Research ◽

10.2478/jaiscr-2014-0024 ◽

2014 ◽

Vol 4 (1) ◽

pp. 43-56 ◽

Cited By ~ 17

Author(s):

Min Chen ◽

Simone A. Ludwig

Keyword(s):

Cluster Analysis ◽

Particle Swarm Optimization ◽

Fuzzy Clustering ◽

Particle Swarm ◽

Optimal Number ◽

Swarm Optimization ◽

Number Of Clusters ◽

Sammon Mapping ◽

Clustering Approach ◽

Optimal Number Of Clusters

Abstract Fuzzy clustering is a popular unsupervised learning method that is used in cluster analysis. Fuzzy clustering allows a data point to belong to two or more clusters. Fuzzy c-means is the most well-known method that is applied to cluster analysis, however, the shortcoming is that the number of clusters need to be predefined. This paper proposes a clustering approach based on Particle Swarm Optimization (PSO). This PSO approach determines the optimal number of clusters automatically with the help of a threshold vector. The algorithm first randomly partitions the data set within a preset number of clusters, and then uses a reconstruction criterion to evaluate the performance of the clustering results. The experiments conducted demonstrate that the proposed algorithm automatically finds the optimal number of clusters. Furthermore, to visualize the results principal component analysis projection, conventional Sammon mapping, and fuzzy Sammon mapping were used

Download Full-text

Method for determining optimal number of clusters in K-means clustering algorithm

Journal of Computer Applications ◽

10.3724/sp.j.1087.2010.01995 ◽

2010 ◽

Vol 30 (8) ◽

pp. 1995-1998 ◽

Cited By ~ 18

Author(s):

Shi-bing ZHOU ◽

Zhen-yuan XU ◽

Xu-qing TANG

Keyword(s):

Clustering Algorithm ◽

Optimal Number ◽

Number Of Clusters ◽

Optimal Number Of Clusters

Download Full-text

Clustering Count-based RNA Methylation Data Using a Nonparametric Generative Model

Current Bioinformatics ◽

10.2174/1574893613666180601080008 ◽

2018 ◽

Vol 14 (1) ◽

pp. 11-23 ◽

Cited By ~ 3

Author(s):

Lin Zhang ◽

Yanling He ◽

Huaizhi Wang ◽

Hui Liu ◽

Yufei Huang ◽

...

Keyword(s):

Clustering Analysis ◽

Methylation Level ◽

Optimal Number ◽

Generative Model ◽

Methylation Data ◽

Sequencing Data ◽

Number Of Clusters ◽

Rna Methylation ◽

Clustering Effect ◽

Optimal Number Of Clusters

Background: RNA methylome has been discovered as an important layer of gene regulation and can be profiled directly with count-based measurements from high-throughput sequencing data. Although the detailed regulatory circuit of the epitranscriptome remains uncharted, clustering effect in methylation status among different RNA methylation sites can be identified from transcriptome-wide RNA methylation profiles and may reflect the epitranscriptomic regulation. Count-based RNA methylation sequencing data has unique features, such as low reads coverage, which calls for novel clustering approaches. Objective: Besides the low reads coverage, it is also necessary to keep the integer property to approach clustering analysis of count-based RNA methylation sequencing data. Method: We proposed a nonparametric generative model together with its Gibbs sampling solution for clustering analysis. The proposed approach implements a beta-binomial mixture model to capture the clustering effect in methylation level with the original count-based measurements rather than an estimated continuous methylation level. Besides, it adopts a nonparametric Dirichlet process to automatically determine an optimal number of clusters so as to avoid the common model selection problem in clustering analysis. Results: When tested on the simulated system, the method demonstrated improved clustering performance over hierarchical clustering, K-means, MClust, NMF and EMclust. It also revealed on real dataset two novel RNA N6-methyladenosine (m6A) co-methylation patterns that may be induced directly by METTL14 and WTAP, which are two known regulatory components of the RNA m6A methyltransferase complex. Conclusion: Our proposed DPBBM method not only properly handles the count-based measurements of RNA methylation data from sites of very low reads coverage, but also learns an optimal number of clusters adaptively from the data analyzed. Availability: The source code and documents of DPBBM R package are freely available through the Comprehensive R Archive Network (CRAN): https://cran.r-project.org/web/packages/DPBBM/.

Download Full-text