An examination of procedures for determining the number of clusters in a data set

Author(s):  
André Hardy
Keyword(s):  
Data Set ◽  
2013 ◽  
Vol 321-324 ◽  
pp. 1947-1950
Author(s):  
Lei Gu ◽  
Xian Ling Lu

In the initialization of the traditional k-harmonic means clustering, the initial centers are generated randomly and its number is equal to the number of clusters. Although the k-harmonic means clustering is insensitive to the initial centers, this initialization method cannot improve clustering performance. In this paper, a novel k-harmonic means clustering based on multiple initial centers is proposed. The number of the initial centers is more than the number of clusters in this new method. The new method with multiple initial centers can divide the whole data set into multiple groups and combine these groups into the final solution. Experiments show that the presented algorithm can increase the better clustering accuracies than the traditional k-means and k-harmonic methods.


2021 ◽  
Vol 6 (2) ◽  
pp. 48
Author(s):  
Solmin Paembonan ◽  
Hisma Abduh

Dalam penelitian ini menggunakan metode k-means, metode ini dapat digunakan untuk menjadikan beberapa obat yang mirip menjadi suatu kelompok data tertentu. Salah satu cara untuk mengetahui tingkat kemiripan data adalah melalui perhitungan jarak antar data. Semakain kecil jarak antar data semakin tinggi tingkat kemiripan data tersebut dan sebaliknya semakin besar jarak antar data maka semakin rendah tingkat kemiripannya. Tujuan akhir clustering adalah untuk menentukan kelompok dalam sekumpulan data yang tidak berlabel, karena clustering merupakan suatu metode unsupervised dan tidak terdapat suatu kondisi awal untuk sejumlah cluster yang mungkin terbentuk dalam sekumpulan data, maka dibutuhkan suatu evaluasi hasil clustering. Berdasarkan evaluasi yang dilakukan terhadap hasil clustering dengan nilai dari silhouette coeficient = 0,4854. In this study using the k-means method, this method can be used to make several similar drugs into a certain data group. One way to determine the level of similarity of the data is through the calculation of the distance between the data. The smaller the distance between the data, the higher the level of similarity between the data and vice versa, the greater the distance between the data, the lower the similarity level. For a number of clusters that may be formed in a data set, an evaluation of the results of clustering is needed. Based on the evaluation carried out on the results of clustering with the value of the silhouette coefficient = 0.4854.


2020 ◽  
Vol 11 (3) ◽  
pp. 42-67
Author(s):  
Soumeya Zerabi ◽  
Souham Meshoul ◽  
Samia Chikhi Boucherkha

Cluster validation aims to both evaluate the results of clustering algorithms and predict the number of clusters. It is usually achieved using several indexes. Traditional internal clustering validation indexes (CVIs) are mainly based in computing pairwise distances which results in a quadratic complexity of the related algorithms. The existing CVIs cannot handle large data sets properly and need to be revisited to take account of the ever-increasing data set volume. Therefore, design of parallel and distributed solutions to implement these indexes is required. To cope with this issue, the authors propose two parallel and distributed models for internal CVIs namely for Silhouette and Dunn indexes using MapReduce framework under Hadoop. The proposed models termed as MR_Silhouette and MR_Dunn have been tested to solve both the issue of evaluating the clustering results and identifying the optimal number of clusters. The results of experimental study are very promising and show that the proposed parallel and distributed models achieve the expected tasks successfully.


2013 ◽  
Vol 462-463 ◽  
pp. 438-442
Author(s):  
Ming Gu

Neural network with quadratic junction was described. Structure, properties and unsupervised learning rules of the neural network were discussed. An ART-based hierarchical clustering algorithm using this kind of neural networks was suggested. The algorithm can determine the number of clusters and clustering data. A 2-D artificial data set is used to illustrate and compare the effectiveness of the proposed algorithm and K-means algorithm.


2020 ◽  
Vol 10 (6) ◽  
pp. 1401-1407
Author(s):  
Hyungtai Kim ◽  
Minhee Lee ◽  
Min Kyun Sohn ◽  
Jongmin Lee ◽  
Deog Yung Kim ◽  
...  

This paper shows the simultaneous clustering and classification that is done in order to discover internal grouping on an unlabeled data set. Moreover, it simultaneously classifies the data using clusters discovered as class labels. During the simultaneous clustering and classification, silhouette and F1 scores were calculated for clustering and classification, respectively, according to the number of clusters in order to find an optimal number of clusters that guarantee the desired level of classification performance. In this study, we applied this approach to the data set of Ischemic stroke patients in order to discover function recovery patterns where clear diagnoses do not exist. In addition, we have developed a classifier that predicts the type of function recovery for new patients with early clinical test scores in clinically meaningful levels of accuracy. This classifier can be a helpful tool for clinicians in the rehabilitation field.


2012 ◽  
Vol 532-533 ◽  
pp. 1373-1377 ◽  
Author(s):  
Ai Ping Deng ◽  
Ben Xiao ◽  
Hui Yong Yuan

In allusion to the disadvantage of having to obtain the number of clusters in advance and the sensitivity to selecting initial clustering centers in the K-means algorithm, an improved K-means algorithm is proposed, that the cluster centers and the number of clusters are dynamically changing. The new algorithm determines the cluster centers by calculating the density of data points and shared nearest neighbor similarity, and controls the clustering categories by using the average shared nearest neighbor self-similarity.The experimental results of IRIS testing data set show that the algorithm can select the cluster cennters and can distinguish between different types of cluster efficiently.


2021 ◽  
Vol 6 (1) ◽  
pp. 41
Author(s):  
I Kadek Dwi Gandika Supartha ◽  
Adi Panca Saputra Iskandar

In this study, clustering data on STMIK STIKOM Indonesia alumni using the Fuzzy C-Means and Fuzzy Subtractive methods. The method used to test the validity of the cluster is the Modified Partition Coefficient (MPC) and Classification Entropy (CE) index. Clustering is carried out with the aim of finding hidden patterns or information from a fairly large data set, considering that so far the alumni data at STMIK STIKOM Indonesia have not undergone a data mining process. The results of measuring cluster validity using the Modified Partition Coefficient (MPC) and Classification Entropy (CE) index, the Fuzzy C-Means Clustering algorithm has a higher level of validity than the Fuzzy Subtractive Clustering algorithm so it can be said that the Fuzzy C-Means algorithm performs the cluster process better than with the Fuzzy Subtractive method in clustering alumni data. The number of clusters that have the best fitness value / the most optimal number of clusters based on the CE and MPC validity index is 5 clusters. The cluster that has the best characteristics is the 1st cluster which has 514 members (36.82% of the total alumni). With the characteristics of having an average GPA of 3.3617, the average study period is 7.8102 semesters and an average TA work period of 4.9596 months.


Author(s):  
Md. Zakir Hossain ◽  
Md.Nasim Akhtar ◽  
R.B. Ahmad ◽  
Mostafijur Rahman

<span>Data mining is the process of finding structure of data from large data sets. With this process, the decision makers can make a particular decision for further development of the real-world problems. Several data clusteringtechniques are used in data mining for finding a specific pattern of data. The K-means method isone of the familiar clustering techniques for clustering large data sets.  The K-means clustering method partitions the data set based on the assumption that the number of clusters are fixed.The main problem of this method is that if the number of clusters is to be chosen small then there is a higher probability of adding dissimilar items into the same group. On the other hand, if the number of clusters is chosen to be high, then there is a higher chance of adding similar items in the different groups. In this paper, we address this issue by proposing a new K-Means clustering algorithm. The proposed method performs data clustering dynamically. The proposed method initially calculates a threshold value as a centroid of K-Means and based on this value the number of clusters are formed. At each iteration of K-Means, if the Euclidian distance between two points is less than or equal to the threshold value, then these two data points will be in the same group. Otherwise, the proposed method will create a new cluster with the dissimilar data point. The results show that the proposed method outperforms the original K-Means method.</span>


2019 ◽  
Author(s):  
Attila Lengyel ◽  
David W. Roberts ◽  
Zoltán Botta-Dukát

AbstractAimsTo introduce REMOS, a new iterative reallocation method (with two variants) for vegetation classification, and to compare its performance with OPTSIL. We test (1) how effectively REMOS and OPTSIL maximize mean silhouette width and minimize the number of negative silhouette widths when run on classifications with different structure; (2) how these three methods differ in runtime with different sample sizes; and (3) if classifications by the three reallocation methods differ in the number of diagnostic species, a surrogate for interpretability.Study areaSimulation; example data sets from grasslands in Hungary and forests in Wyoming and Utah, USA.MethodsWe classified random subsets of simulated data with the flexible-beta algorithm for different values of beta. These classifications were subsequently optimized by REMOS and OPTSIL and compared for mean silhouette widths and proportion of negative silhouette widths. Then, we classified three vegetation data sets of different sizes from two to ten clusters, optimized them with the reallocation methods, and compared their runtimes, mean silhouette widths, numbers of negative silhouette widths, and the number of diagnostic species.ResultsIn terms of mean silhouette width, OPTSIL performed the best when the initial classifications already had high mean silhouette width. REMOS algorithms had slightly lower mean silhouette width than what was maximally achievable with OPTSIL but their efficiency was consistent across different initial classifications; thus REMOS was significantly superior to OPTSIL when the initial classification had low mean silhouette width. REMOS resulted in zero or a negligible number of negative silhouette widths across all classifications. OPTSIL performed similarly when the initial classification was effective but could not reach as low proportion of misclassified objects when the initial classification was inefficient. REMOS algorithms were typically more than an order of magnitude faster to calculate than OPTSIL. There was no clear difference between REMOS and OPTSIL in the number of diagnostic species.ConclusionsREMOS algorithms may be preferable to OPTSIL when (1) the primary objective is to reduce or eliminate negative silhouette widths in a classification, (2) the initial classification has low mean silhouette width, or (3) when the time efficiency of the algorithm is important because of the size of the data set or the high number of clusters.


Sign in / Sign up

Export Citation Format

Share Document