Method for determining optimal number of clusters in K-means clustering algorithm

2010 ◽  
Vol 30 (8) ◽  
pp. 1995-1998 ◽  
Author(s):  
Shi-bing ZHOU ◽  
Zhen-yuan XU ◽  
Xu-qing TANG
2021 ◽  
Author(s):  
Congming Shi ◽  
Bingtao Wei ◽  
Shoulin Wei ◽  
Wen Wang ◽  
Hai Liu ◽  
...  

Abstract Clustering, a traditional machine learning method, plays a significant role in data analysis. Most clustering algorithms depend on a predetermined exact number of clusters, whereas, in practice, clusters are usually unpredictable. Although the Elbow method is one of the most commonly used methods to discriminate the optimal cluster number, the discriminant of the number of clusters depends on the manual identification of the elbow points on the visualization curve. Thus, experienced analysts cannot clearly identify the elbow point from the plotted curve when the plotted curve is fairly smooth. To solve this problem, a new elbow point discriminant method is proposed to yield a statistical metric that estimates an optimal cluster number when clustering on a dataset. First, the average degree of distortion obtained by the Elbow method is normalized to the range of 0 to 10. Second, the normalized results are used to calculate the cosine of intersection angles between elbow points. Third, this calculated cosine of intersection angles and the arccosine theorem are used to compute the intersection angles between elbow points. Finally, the index of the above computed minimal intersection angles between elbow points is used as the estimated potential optimal cluster number. The experimental results based on simulated datasets and a well-known public dataset (Iris Dataset) demonstrated that the estimated optimal cluster number obtained by our newly proposed method is better than the widely used Silhouette method.


2020 ◽  
Vol 9 (1) ◽  
pp. 1
Author(s):  
L. W. Rizkallah ◽  
M. F. Ahmed ◽  
N. M. Darwish

The Vehicle Routing Problem (VRP) consists of a group of customers that needs to be served. Each customer has a certain demand of goods. A central depot having a fleet of vehicles is responsible for supplying the customers with their demands. The problem is composed of two sub-problems: The first sub-problem is an assignment problem where both the vehicles that will be used as well as the customers assigned to each vehicle are determined. The second sub-problem is the routing problem in which for each vehicle having a number of cus-tomers assigned to it, the order of visits of the customers is determined. Optimal number of vehicles as well as optimal total distance should be achieved. In this paper, an approach for solving the first sub-problem, the assignment problem, is presented. In the approach, a clustering algorithm is proposed for finding the optimal number of vehicles by grouping the customers into clusters where each cluster is visited by one vehicle. This work presents a polynomial time clustering algorithm for finding the optimal number of clusters. Also, a solution to the assignment problem is provided. The proposed approach was evaluated using Solomon’s C1 benchmarks where it reached optimal number of clusters for all the benchmarks in this category. The proposed approach succeeds in solving the assignment problem in VRP achieving a solving time that surpasses the state-of-the-art approaches provided in the literature. It also provides a means of working with varying num-ber of customers without major increase in solving time.  


Author(s):  
Kristína Lehocká ◽  
Barbora Olšanská ◽  
Radovan Kasarda ◽  
Ondrej Kadlečík ◽  
Anna Trakovická ◽  
...  

The objective of the study was to determine the membership probability and level of admixture among Slovak Spotted cattle and historically related breeds (Ayshire, Holstein, Swiss Simmental and Slovak Pinzgau). The analysis was based on the panel of 35 934 SNPs that were used for genotyping of 423 individuals. The optimal number of clusters was estimated in two ways; by analysis of Bayesian information criterion and Bayesian clustering algorithm. The optimal number of clusters ranged from 3 to 5, depending on the applied approach. Subsequently, the population structure was tested by discriminant analysis of principal components (DAPC) and unsupervised Bayesian analysis based on the correlated allele frequencies model. The first discriminant function revealed three genetic clusters in population resulting from the production type and origin of analysed breeds. The unsupervised Bayesian analysis showed similar results, where the highest level of admixture was found between Slovak Pinzgau and Slovak Spotted cattle (0.6%). Despite that, the results of this study clearly showed that the Slovak Spotted cattle is genetically separated from other breeds that were involved in its grading-up process.


2020 ◽  
Author(s):  
Congming Shi ◽  
Bingtao Wei ◽  
Shoulin Wei ◽  
Wen Wang ◽  
Hai Liu ◽  
...  

Abstract Clustering, as a traditional machine learning method, is still playing a significant role in data analysis. The most of clustering algorithms depend on a predetermined exact number of clusters, whereas, in practice, clusters are usually unpredictable. Although elbow method is one of the most commonly used methods to discriminate the optimal cluster number, the discriminant of the number of clusters depends on manual identification of the elbow points on the visualization curve, which will lead to the experienced analysts not being able to clearly identify the elbow point from the plotted curve when the plotted curve being fairly smooth. To solve this problem, a new elbow point discriminant method is proposed to work out a statistical metric estimating an optimal cluster number when clustering on a dataset. Firstly, the average degree of distortion obtained by Elbow method is normalized to the range of 0 to10; Secondly, the normalized results are used to calculate Cosine of intersection angles between elbow points; Thirdly, the above calculated Cosine of intersection angles and Arccosine theorem are used to compute the intersection angles between elbow points; Finally, the index of the above computed minimal intersection angles between elbow points is used as the estimated potential optimal cluster number. The experimental results based on simulated datasets and a public well-known dataset demonstrated that the estimated optimal cluster number output by our newly proposed method is better than widely used Silhouette method.


2015 ◽  
Vol 24 (2) ◽  
pp. 215-222 ◽  
Author(s):  
Wei Jia Lu ◽  
Zhuang Zhi Yan

AbstractThe fuzzy clustering algorithm has been widely used in the research area and production and life. However, the conventional fuzzy algorithms have a disadvantage of high computational complexity. This article proposes an improved fuzzy C-means (FCM) algorithm based on K-means and principle of granularity. This algorithm is aiming at solving the problems of optimal number of clusters and sensitivity to the data initialization in the conventional FCM methods. The initialization stage of the K-medoid cluster, which is different from others, has a strong representation and is capable of detecting data with different sizes. Meanwhile, through the combination of the granular computing and FCM, the optimal number of clusters is obtained by choosing accurate validity functions. Finally, the detailed clustering process of the proposed algorithm is presented, and its performance is validated by simulation tests. The test results show that the proposed improved FCM algorithm has enhanced clustering performance in the computational complexity, running time, cluster effectiveness compared with the existing FCM algorithms.


2013 ◽  
Vol 392 ◽  
pp. 803-807 ◽  
Author(s):  
Xue Bo Feng ◽  
Fang Yao ◽  
Zhi Gang Li ◽  
Xiao Jing Yang

According to the number of cluster centers, initial cluster centers, fuzzy factor, iterations and threshold, Fuzzy C-means clustering algorithm (FCM) clusters the data set. FCM will encounter the initialization problem of clustering prototype. Firstly, the article combines the maximum and minimum distance algorithm and K-means algorithm to determine the number of clusters and the initial cluster centers. Secondly, the article determines the optimal number of clusters with Silhouette indicators. Finally, the article improves the convergence rate of FCM by revising membership constantly. The improved FCM has good clustering effect, enhances the optimized capability, and improves the efficiency and effectiveness of the clustering. It has better tightness in the class, scatter among classes and cluster stability and faster convergence rate than the traditional FCM clustering method.


Author(s):  
Aashish kumar, Et. al.

Software-Defined Networking is one of the most revolutionary and prominent technology in the field of networking. It solves the problem that our traditional network faces. Still it can face a problem of bottleneck and can be overloaded. To overcome this issue, various researcher has it given various works but they are based on two or three-parameter to perform load balancing and also they are static or dynamic. We have proposed an intelligent technique that forwards the packet i.e. TCP/UDP packet traffic based on several parameters (based on 12 parameters discussed in the latter part of this section). Based on these parameters, we have applied the trained machine using KMeans [1] and DBSCAN [2] clustering algorithm and also determine the optimal number of clusters. We have tested it on the huge number of packet that are 5000, 10000, 20000, 50000, 100000, 10000000.We have also compared there results of the KMeans and DBSCAN algorithm and also discussed researchers view


Author(s):  
Afdelia Novianti ◽  
Irsyifa Mayzela Afnan ◽  
Rafi Ilmi Badri Utama ◽  
Edy Widodo

Poverty is an essential issue for every country, including Indonesia. Poverty can be caused by the scarcity of basic necessities or the difficulty of accessing education and employment. In 2019 Papua Province became the province with the highest poverty percentage at 27.53%. Seeing this, the district groupings formed in describing poverty conditions in Papua Province are based on similar characteristics using the variables Percentage of Poor Population, Gross Regional Domestic Product, Open Unemployment Rate, Life Expectancy, Literacy Rate, and Population Working in the Agricultural Sector using K-medoids clustering algorithm. The results of this study indicate that the optimal number of clusters to describe poverty conditions in Papua Province is 4 clusters with a variance of 0.012, where the first cluster consists of 10 districts, the second cluster consists of 5 districts, the third cluster consists of 12 districts, and the fourth cluster consists of 2 districts.


2020 ◽  
Author(s):  
Congming Shi ◽  
Bingtao Wei ◽  
Shoulin Wei ◽  
Wen Wang ◽  
Hai Liu ◽  
...  

Abstract Clustering, as a traditional machine learning method, is still playing a significant role in data analysis. The most of clustering algorithms depend on a predetermined exact number of clusters, whereas, in practice, clusters are usually unpredictable. Although elbow method is one of the most commonly used methods to discriminate the optimal cluster number, the discriminant of the number of clusters depends on manual identification of the elbow points on the visualization curve, which will lead to the experienced analysts not being able to clearly identify the elbow point from the plotted curve when the plotted curve being fairly smooth. To solve this problem, a new elbow point discriminant method is proposed to work out a statistical metric estimating an optimal cluster number when clustering on a dataset. Firstly, the average degree of distortion obtained by Elbow method is normalized to the range of 0 to10; Secondly, the normalized results are used to calculate Cosine of intersection angles between elbow points; Thirdly, the above calculated Cosine of intersection angles and Arccosine theorem are used to compute the intersection angles between elbow points; Finally, the index of the above computed minimal intersection angles between elbow points is used as the estimated potential optimal cluster number. The experimental results based on simulated datasets and a public well-known dataset (Iris Dataset) demonstrated that the estimated optimal cluster number output by our newly proposed method is better than widely used Silhouette method.


Sign in / Sign up

Export Citation Format

Share Document