scholarly journals ALTERNATIVE TERMINATION CRITERION FOR K-SPECIFIED CRISP DATA CLUSTERING ALGORITHMS

Author(s):  
Volodymyr Mosorov ◽  
Taras Panskyi ◽  
Sebastian Biedron

In this paper the analysis of k-specified (namely k-means) crisp data partitioning pre-clustering algorithm’s termination criterion performance is described. The results have been analyzed using the clustering validity indices. Termination criterion allows analyzing data with any number of clusters. Moreover, introduced criterion in contrast to the known validity indices enables to analyze data that make up one cluster.

Algorithms ◽  
2018 ◽  
Vol 11 (11) ◽  
pp. 177 ◽  
Author(s):  
Xuedong Gao ◽  
Minghan Yang

Clustering is one of the main tasks of machine learning. Internal clustering validation indexes (CVIs) are used to measure the quality of several clustered partitions to determine the local optimal clustering results in an unsupervised manner, and can act as the objective function of clustering algorithms. In this paper, we first studied several well-known internal CVIs for categorical data clustering, and proved the ineffectiveness of evaluating the partitions of different numbers of clusters without any inter-cluster separation measures or assumptions; the accurateness of separation, along with its coordination with the intra-cluster compactness measures, can notably affect performance. Then, aiming to enhance the internal clustering validation measurement, we proposed a new internal CVI—clustering utility based on the averaged information gain of isolating each cluster (CUBAGE)—which measures both the compactness and the separation of the partition. The experimental results supported our findings with regard to the existing internal CVIs, and showed that the proposed CUBAGE outperforms other internal CVIs with or without a pre-known number of clusters.


2019 ◽  
Vol 2019 ◽  
pp. 1-9 ◽  
Author(s):  
Yongli Liu ◽  
Xiaoyang Zhang ◽  
Jingli Chen ◽  
Hao Chao

Because traditional fuzzy clustering validity indices need to specify the number of clusters and are sensitive to noise data, we propose a validity index for fuzzy clustering, named CSBM (compactness separateness bipartite modularity), based on bipartite modularity. CSBM enhances the robustness by combining intraclass compactness and interclass separateness and can automatically determine the optimal number of clusters. In order to estimate the performance of CSBM, we carried out experiments on six real datasets and compared CSBM with other six prominent indices. Experimental results show that the CSBM index performs the best in terms of robustness while accurately detecting the number of clusters.


Ultrasound Imaging is one of the techniques used to study inside human body with images generated using high frequency sounds waves. The applications of ultrasound images include examination of human body parts such as Kidney, Liver, Heart and Ovaries. This paper mainly concentrates on ultrasound images of ovaries.Monitoring of follicle is important in human reproduction. This paper presents a method for follicle detection in ultrasound image of ovaries using Adaptive data clustering algorithms. The main requirements for any clustering algorithm are to initialize the value of K, i.e. the number of clusters. Estimating this K value is difficult task for given data. This paper presents adaptive data clustering algorithm which generates accurate segmentation results with simple operation and avoids the interactive input K (number of clusters) value for segmentation. The results represent adaptive data clustering algorithms are better than normal algorithms for clustering in ultrasound image segmentation. After segmentation, using the region properties of the image, the follicles in the ovary image are identified. The proposed algorithm is tested on sample ultrasound images of ovaries for identification of follicles and with the region properties, the ovaries are classified into categories, normal, cystic and polycystic ovary with its geometric properties.


2021 ◽  
Author(s):  
Amir Mosavi ◽  
Majid

Identifying the number of oil families in petroleum basins provides practical and valuable information in petroleum geochemistry studies from exploration to development. Oil family grouping helps us track migration pathways, identify the number of active source rock(s), and examine the reservoir continuity. To date, almost in all oil family typing studies, common statistical methods such as principal component analysis (PCA) and hierarchical clustering analysis (HCA) have been used. However, there is no publication regarding using artificial neural networks (ANNs) for examining the oil families in petroleum basins. Hence, oil family typing requires novel, not overused and common techniques. This paper is the first report of oil family typing using ANNs as robust computational methods. To this end, a self-organization map (SOM) neural network associated with three clustering validity indices were employed on oil samples belonging to the Iranian part of the Persian Gulf’ oilfields. For the SOM network, at first, ten default clusters were selected. Afterwards, three effective clustering validity coefficients, namely Calinski-Harabasz (CH), Silhouette indexes (SI) and Davies-Bouldin (DB), were operated to find the optimum number of clusters. Accordingly, among ten default clusters, the maximum CH (62) and SI (0.58) were acquired for four clusters. Likewise, the lowest DB (0.8) was obtained for four clusters. Thus, all three validation coefficients introduced four clusters as the optimum number of clusters or oil families. The number of oil families identified in the present report is consistent with those previously reported by other researchers in the same study area. However, the techniques used in the present paper, which have not been implemented so far, can be introduced as more straightforward for clustering purposes in the oil family typing than those of common and overused methods of PCA and HCA.


2020 ◽  
Vol 5 (1) ◽  
Author(s):  
Zijing Liu ◽  
Mauricio Barahona

AbstractWe present a graph-theoretical approach to data clustering, which combines the creation of a graph from the data with Markov Stability, a multiscale community detection framework. We show how the multiscale capabilities of the method allow the estimation of the number of clusters, as well as alleviating the sensitivity to the parameters in graph construction. We use both synthetic and benchmark real datasets to compare and evaluate several graph construction methods and clustering algorithms, and show that multiscale graph-based clustering achieves improved performance compared to popular clustering methods without the need to set externally the number of clusters.


2021 ◽  
Author(s):  
Yashuang Mu ◽  
Wei Wei ◽  
Hongyue Guo ◽  
Lijun Sun

Abstract In this study, a layered parallel algorithm via fuzzy c-means (FCM) technique, called LP-FCM, is proposed in the framework of Map-Reduce for data clustering problems. The LP-FCM mainly contains three layers. The first layer follows a parallel data partitioning method which is developed to randomly divide the original dataset into several subdatasets. The second layer uses a parallel cluster centers searching method based on Map-Reduce, where the classic FCM algorithm is applied to search the cluster centers for each subdataset in Map phases, and all the centers are gathered to the Reduce phase for the confirmation of the final cluster centers through FCM technique again. The third layer implements a parallel data clustering method based on the final cluster centers. After comparing with some famous classic random initialization sequential clustering algorithms which include K-means, K-medoids, FCM and MinMaxK-means on 20 benchmark datasets, the feasibility in terms of clustering accuracy is evaluated. Furthermore, the clustering time and the parallel performance are also tested on some generated large-scale datasets for the parallelism.


2021 ◽  
Author(s):  
Majid ◽  
Amir Mosavi

Identifying the number of oil families in petroleum basins provides practical and valuable information in petroleum geochemistry studies from exploration to development. Oil family grouping helps us track migration pathways, identify the number of active source rock(s), and examine the reservoir continuity. To date, almost in all oil family typing studies, common statistical methods such as principal component analysis (PCA) and hierarchical clustering analysis (HCA) have been used. However, there is no publication regarding using artificial neural networks (ANNs) for examining the oil families in petroleum basins. Hence, oil family typing requires novel, not overused and common techniques. This paper is the first report of oil family typing using ANNs as robust computational methods. To this end, a self-organization map (SOM) neural network associated with three clustering validity indices were employed on oil samples belonging to the Iranian part of the Persian Gulf’ oilfields. For the SOM network, at first, ten default clusters were selected. Afterwards, three effective clustering validity coefficients, namely Calinski-Harabasz (CH), Silhouette indexes (SI) and Davies-Bouldin (DB), were operated to find the optimum number of clusters. Accordingly, among ten default clusters, the maximum CH (62) and SI (0.58) were acquired for four clusters. Likewise, the lowest DB (0.8) was obtained for four clusters. Thus, all three validation coefficients introduced four clusters as the optimum number of clusters or oil families. The number of oil families identified in the present report is consistent with those previously reported by other researchers in the same study area. However, the techniques used in the present paper, which have not been implemented so far, can be introduced as more straightforward for clustering purposes in the oil family typing than those of common and overused methods of PCA and HCA.


Water ◽  
2020 ◽  
Vol 12 (5) ◽  
pp. 1372
Author(s):  
Nikhil Bhatia ◽  
Jency M. Sojan ◽  
Slobodon Simonovic ◽  
Roshan Srivastav

The delineation of precipitation regions is to identify homogeneous zones in which the characteristics of the process are statistically similar. The regionalization process has three main components: (i) delineation of regions using clustering algorithms, (ii) determining the optimal number of regions using cluster validity indices (CVIs), and (iii) validation of regions for homogeneity using L-moments ratio test. The identification of the optimal number of clusters will significantly affect the homogeneity of the regions. The objective of this study is to investigate the performance of the various CVIs in identifying the optimal number of clusters, which maximizes the homogeneity of the precipitation regions. The k-means clustering algorithm is adopted to delineate the regions using location-based attributes for two large areas from Canada, namely, the Prairies and the Great Lakes-St Lawrence lowlands (GL-SL) region. The seasonal precipitation data for 55 years (1951–2005) is derived using high-resolution ANUSPLIN gridded point data for Canada. The results indicate that the optimal number of clusters and the regional homogeneity depends on the CVI adopted. Among 42 cluster indices considered, 15 of them outperform in identifying the homogeneous precipitation regions. The Dunn, D e t _ r a t i o and Trace( W − 1 B ) indices found to be the best for all seasons in both the regions.


2017 ◽  
Vol 26 (3) ◽  
pp. 483-503 ◽  
Author(s):  
Vijay Kumar ◽  
Jitender Kumar Chhabra ◽  
Dinesh Kumar

AbstractFinding the optimal number of clusters and the appropriate partitioning of the given dataset are the two major challenges while dealing with clustering. For both of these, cluster validity indices are used. In this paper, seven widely used cluster validity indices, namely DB index, PS index, I index, XB index, FS index, K index, and SV index, have been developed based on line symmetry distance measures. These indices provide the measure of line symmetry present in the partitioning of the dataset. These are able to detect clusters of any shape or size in a given dataset, as long as they possess the property of line symmetry. The performance of these indices is evaluated on three clustering algorithms: K-means, fuzzy-C means, and modified harmony search-based clustering (MHSC). The efficacy of symmetry-based validity indices on clustering algorithms is demonstrated on artificial and real-life datasets, six each, with the number of clusters varying from 2 to $\sqrt n ,$ where n is the total number of data points existing in the dataset. The experimental results reveal that the incorporation of line symmetry-based distance improves the capabilities of these existing validity indices in finding the appropriate number of clusters. Comparisons of these indices are done with the point symmetric and original versions of these seven validity indices. The results also demonstrate that the MHSC technique performs better as compared to other well-known clustering techniques. For real-life datasets, analysis of variance statistical analysis is also performed.


Sign in / Sign up

Export Citation Format

Share Document