Estimating the number of clusters in a numerical data set via quantization error modeling

2015 ◽  
Vol 48 (3) ◽  
pp. 941-952 ◽  
Author(s):  
Alexander Kolesnikov ◽  
Elena Trichina ◽  
Tuomo Kauranne
2013 ◽  
Vol 321-324 ◽  
pp. 1947-1950
Author(s):  
Lei Gu ◽  
Xian Ling Lu

In the initialization of the traditional k-harmonic means clustering, the initial centers are generated randomly and its number is equal to the number of clusters. Although the k-harmonic means clustering is insensitive to the initial centers, this initialization method cannot improve clustering performance. In this paper, a novel k-harmonic means clustering based on multiple initial centers is proposed. The number of the initial centers is more than the number of clusters in this new method. The new method with multiple initial centers can divide the whole data set into multiple groups and combine these groups into the final solution. Experiments show that the presented algorithm can increase the better clustering accuracies than the traditional k-means and k-harmonic methods.


2018 ◽  
Vol 11 (2) ◽  
pp. 53-67
Author(s):  
Ajay Kumar ◽  
Shishir Kumar

Several initial center selection algorithms are proposed in the literature for numerical data, but the values of the categorical data are unordered so, these methods are not applicable to a categorical data set. This article investigates the initial center selection process for the categorical data and after that present a new support based initial center selection algorithm. The proposed algorithm measures the weight of unique data points of an attribute with the help of support and then integrates these weights along the rows, to get the support of every row. Further, a data object having the largest support is chosen as an initial center followed by finding other centers that are at the greatest distance from the initially selected center. The quality of the proposed algorithm is compared with the random initial center selection method, Cao's method, Wu method and the method introduced by Khan and Ahmad. Experimental analysis on real data sets shows the effectiveness of the proposed algorithm.


2021 ◽  
Vol 6 (2) ◽  
pp. 48
Author(s):  
Solmin Paembonan ◽  
Hisma Abduh

Dalam penelitian ini menggunakan metode k-means, metode ini dapat digunakan untuk menjadikan beberapa obat yang mirip menjadi suatu kelompok data tertentu. Salah satu cara untuk mengetahui tingkat kemiripan data adalah melalui perhitungan jarak antar data. Semakain kecil jarak antar data semakin tinggi tingkat kemiripan data tersebut dan sebaliknya semakin besar jarak antar data maka semakin rendah tingkat kemiripannya. Tujuan akhir clustering adalah untuk menentukan kelompok dalam sekumpulan data yang tidak berlabel, karena clustering merupakan suatu metode unsupervised dan tidak terdapat suatu kondisi awal untuk sejumlah cluster yang mungkin terbentuk dalam sekumpulan data, maka dibutuhkan suatu evaluasi hasil clustering. Berdasarkan evaluasi yang dilakukan terhadap hasil clustering dengan nilai dari silhouette coeficient = 0,4854. In this study using the k-means method, this method can be used to make several similar drugs into a certain data group. One way to determine the level of similarity of the data is through the calculation of the distance between the data. The smaller the distance between the data, the higher the level of similarity between the data and vice versa, the greater the distance between the data, the lower the similarity level. For a number of clusters that may be formed in a data set, an evaluation of the results of clustering is needed. Based on the evaluation carried out on the results of clustering with the value of the silhouette coefficient = 0.4854.


2020 ◽  
Vol 11 (3) ◽  
pp. 42-67
Author(s):  
Soumeya Zerabi ◽  
Souham Meshoul ◽  
Samia Chikhi Boucherkha

Cluster validation aims to both evaluate the results of clustering algorithms and predict the number of clusters. It is usually achieved using several indexes. Traditional internal clustering validation indexes (CVIs) are mainly based in computing pairwise distances which results in a quadratic complexity of the related algorithms. The existing CVIs cannot handle large data sets properly and need to be revisited to take account of the ever-increasing data set volume. Therefore, design of parallel and distributed solutions to implement these indexes is required. To cope with this issue, the authors propose two parallel and distributed models for internal CVIs namely for Silhouette and Dunn indexes using MapReduce framework under Hadoop. The proposed models termed as MR_Silhouette and MR_Dunn have been tested to solve both the issue of evaluating the clustering results and identifying the optimal number of clusters. The results of experimental study are very promising and show that the proposed parallel and distributed models achieve the expected tasks successfully.


Author(s):  
A. Andreini ◽  
A. Bonini ◽  
G. Caciolli ◽  
B. Facchini ◽  
S. Taddei

Due to the stringent cooling requirements of novel aero-engines combustor liners, a comprehensive understanding of the phenomena concerning the interaction of hot gases with typical coolant jets plays a major role in the design of efficient cooling systems. In this work, an aerodynamic analysis of the effusion cooling system of an aero-engine combustor liner was performed; the aim was the definition of a correlation for the discharge coefficient (CD) of the single effusion hole. The data were taken from a set of CFD RANS (Reynolds-averaged Navier-Stokes) simulations, in which the behavior of the effusion cooling system was investigated over a wide range of thermo/fluid-dynamics conditions. In some of these tests, the influence on the effusion flow of an additional air bleeding port was taken into account, making it possible to analyze its effects on effusion holes CD. An in depth analysis of the numerical data set has pointed out the opportunity of an efficient reduction through the ratio of the annulus and the hole Reynolds numbers: The dependence of the discharge coefficients from this parameter is roughly linear. The correlation was included in an in-house one-dimensional thermo/fluid network solver, and its results were compared with CFD data. An overall good agreement of pressure and mass flow rate distributions was observed. The main source of inaccuracy was observed in the case of relevant air bleed mass flow rates due to the inherent three-dimensional behavior of the flow close to bleed opening. An additional comparison with experimental data was performed in order to improve the confidence in the accuracy of the correlation: Within the validity range of pressure ratios in which the correlation is defined (>1.02), this comparison pointed out a good reliability in the prediction of discharge coefficients. An approach to model air bleeding was then proposed, with the assessment of its impact on liner wall temperature prediction.


Author(s):  
Antonia J. Jones ◽  
Dafydd Evans ◽  
Steve Margetts ◽  
Peter J. Durrant

The Gamma Test is a non-linear modelling analysis tool that allows us to quantify the extent to which a numerical input/output data set can be expressed as a smooth relationship. In essence, it allows us to efficiently calculate that part of the variance of the output that cannot be accounted for by the existence of any smooth model based on the inputs, even though this model is unknown. A key aspect of this tool is its speed: the Gamma Test has time complexity O(Mlog M), where M is the number of datapoints. For data sets consisting of a few thousand points and a reasonable number of attributes, a single run of the Gamma Test typically takes a few seconds. In this chapter we will show how the Gamma Test can be used in the construction of predictive models and classifiers for numerical data. In doing so, we will demonstrate the use of this technique for feature selection, and for the selection of embedding dimension when dealing with a time-series.


Author(s):  
Jörg-Peter Schräpler

SummaryThis paper focuses on fraud detection in surveys using Socio-Economic Panel (SOEP) data as an example for testing newly methods proposed here. A statistical theorem referred to as Benford’s Law states that in many sets of numerical data, the significant digits are not uniformly distributed, as one might expect, but adhere to a certain logarithmic probability function. In order to detect fraud, we derive several requirements that should, according to this law, be fulfilled in the case of survey data.We show that in several SOEP subsamples, Benford’s Law holds for the available continuous data. For this analysis, we developed a measure that reflects the plausibility of the digit distribution in interviewer clusters. We are thus able to demonstrate that several interviews that were known to have been fabricated and therefore deleted in the original user data set can now be detected using this method. Furthermore, in one subsample, we use this method to identify a case of an interviewer falsifying ten interviews not previously detected by the fieldwork organization.


2013 ◽  
Vol 462-463 ◽  
pp. 438-442
Author(s):  
Ming Gu

Neural network with quadratic junction was described. Structure, properties and unsupervised learning rules of the neural network were discussed. An ART-based hierarchical clustering algorithm using this kind of neural networks was suggested. The algorithm can determine the number of clusters and clustering data. A 2-D artificial data set is used to illustrate and compare the effectiveness of the proposed algorithm and K-means algorithm.


Sign in / Sign up

Export Citation Format

Share Document