SELECTION OF THE NUMBER OF CLUSTERS IN K-MEAN ALGORITHM USING CLUSTER SOLUTION ENTROPY

Author(s):  
V. I. Oreshkov ◽  

The article discusses the problem of choosing the number of clusters in popular k-means clustering algorithm. It is noted that an unsuccessful choice of this hyper parameter can lead to the creation of a cluster structure the meaningful interpretation of which in the process of data mining leads to false conclusions and making incorrect management decisions based on them. The aim of the work is to develop a method for automatic selection of the number of clusters for k-means algorithm. The article provides an analytical review of the known methods for determining the number of clusters, their advantages and disadvantages being noted. The proposed approach is based on the elbow method, which uses the entropy of cluster solutions instead of the mean squares of clustering error. A practical example shows that the use of cluster solution entropy makes it possible to choose the number of clusters even in the case when the approach based on clustering error turns out to be untenable.

2018 ◽  
Vol 3 (1) ◽  
pp. 001
Author(s):  
Zulhendra Zulhendra ◽  
Gunadi Widi Nurcahyo ◽  
Julius Santony

In this study using Data Mining, namely K-Means Clustering. Data Mining can be used in searching for a large enough data analysis that aims to enable Indocomputer to know and classify service data based on customer complaints using Weka Software. In this study using the algorithm K-Means Clustering to predict or classify complaints about hardware damage on Payakumbuh Indocomputer. And can find out the data of Laptop brands most do service on Indocomputer Payakumbuh as one of the recommendations to consumers for the selection of Laptops.


Complexity ◽  
2018 ◽  
Vol 2018 ◽  
pp. 1-16 ◽  
Author(s):  
Yiwen Zhang ◽  
Yuanyuan Zhou ◽  
Xing Guo ◽  
Jintao Wu ◽  
Qiang He ◽  
...  

The K-means algorithm is one of the ten classic algorithms in the area of data mining and has been studied by researchers in numerous fields for a long time. However, the value of the clustering number k in the K-means algorithm is not always easy to be determined, and the selection of the initial centers is vulnerable to outliers. This paper proposes an improved K-means clustering algorithm called the covering K-means algorithm (C-K-means). The C-K-means algorithm can not only acquire efficient and accurate clustering results but also self-adaptively provide a reasonable numbers of clusters based on the data features. It includes two phases: the initialization of the covering algorithm (CA) and the Lloyd iteration of the K-means. The first phase executes the CA. CA self-organizes and recognizes the number of clusters k based on the similarities in the data, and it requires neither the number of clusters to be prespecified nor the initial centers to be manually selected. Therefore, it has a “blind” feature, that is, k is not preselected. The second phase performs the Lloyd iteration based on the results of the first phase. The C-K-means algorithm combines the advantages of CA and K-means. Experiments are carried out on the Spark platform, and the results verify the good scalability of the C-K-means algorithm. This algorithm can effectively solve the problem of large-scale data clustering. Extensive experiments on real data sets show that the accuracy and efficiency of the C-K-means algorithm outperforms the existing algorithms under both sequential and parallel conditions.


2020 ◽  
Vol 4 (157) ◽  
pp. 7-11
Author(s):  
V. Zhvan ◽  
V. Donenko ◽  
S. Kulish ◽  
A. Taran

The article is devoted to the effective analysis of trench and trenchless pipeline laying technologies. In the course of the work, an analytical review of pipeline assembly was performed, the main technological parameters, the scope of each method, and their advantages and disadvantages were determined. List of considered pipeline laying methods: trenching, horizontal directional drilling, mechanical puncture, hydraulic puncture, microtunneling and punching. The article analyzes the classical trench method and the most widely used trenchless ones: horizontal directional drilling; mechanical puncture; hydraulic puncture; microtunneling; punching. Each of these methods has several advantages and disadvantages. The choice of the optimal method of laying the pipeline depends on many factors: the physical and mechanical properties of soils and hydrogeological conditions, the length and diameter of the pipeline, the presence of other communications, buildings and structures, as well as the budget that customers have. Work time is the last deciding factor. Based on the results of the analysis of pipeline laying technologies and expert survey of construction industry experts, the cost table of each method was compiled, outlining the main characteristics of the technology: length of pipeline, speed of work, scope, cost, and the advantages and disadvantages of each of the considered methods. The conclusions about the use of each of the pipeline laying methods were made. Each of the methods has its advantages and disadvantages, so to choose the method of work it is necessary to conduct a comprehensive assessment of technological parameters, cost, scope and timing of work. The cost of lay-ing the pipeline consists of the following factors: conducting research; selection of diameter and determination of pipeline length; choice of laying method and equipment necessary for the works; selection of equipment, shut-off and control equipment and other materials arranged on the pipeline; terms of performance of works. Taking into account these factors, an estimate is made, which determines the cost of installation of a particular pipeline. After the analysis, we can conclude that among the methods of trenchless laying of pipelines can be identi-fied horizontally directional drilling, it is this method of laying the pipeline will be appropriate to use for our region. The drilling technique allows to carry out pipelines under obstacles, to pull long segments of networks, to repair site damage. This method is universal and can be used in almost any environment. Keywords: trenches, horizontal directional drilling, mechanical puncture, hydraulic piercing, microtunnelling, punching, pipeline.


2014 ◽  
Vol 926-930 ◽  
pp. 3608-3611 ◽  
Author(s):  
Yi Fan Zhang ◽  
Yong Tao Qian ◽  
Tai Yu Liu ◽  
Shu Yan Wu

In this paper, first introduce data mining knowledge then focuses on the clustering analysis algorithms, including classification clustering algorithm, and each classification typical cluster analysis algorithms, including the formal description of each algorithm as well as the advantages and disadvantages of each algorithm also has a more detailed description. Then carefully introduce data mining algorithm on the basis of cluster analysis. And using cohesion based clustering algorithm with DBSCAN algorithm and clustering in consumer spending in two-dimensional space, 2,000 data points for each area, and get a reasonable clustering results, resulting in hierarchical clustering results valuable information, so as to realize the practical application of the algorithm and clustering analysis theory combined.


2019 ◽  
Vol 8 (4) ◽  
pp. 6036-6040

Data Mining is the foremost vital space of analysis and is pragmatically utilized in totally different domains, It becomes a highly demanding field because huge amounts of data have been collected in various applications. The database can be clustered in more number of ways depending on the clustering algorithm used, parameter settings and other factors. Multiple clustering algorithms can be combined to get the final partitioning of data which provides better clustering results. In this paper, Ensemble hybrid KMeans and DBSCAN (HDKA) algorithm has been proposed to overcome the drawbacks of DBSCAN and KMeans clustering algorithms. The performance of the proposed algorithm improves the selection of centroid points through the centroid selection strategy.For experimental results we have used two dataset Colon and Leukemia from UCI machine learning repository.


Author(s):  
Md. Zakir Hossain ◽  
Md.Nasim Akhtar ◽  
R.B. Ahmad ◽  
Mostafijur Rahman

<span>Data mining is the process of finding structure of data from large data sets. With this process, the decision makers can make a particular decision for further development of the real-world problems. Several data clusteringtechniques are used in data mining for finding a specific pattern of data. The K-means method isone of the familiar clustering techniques for clustering large data sets.  The K-means clustering method partitions the data set based on the assumption that the number of clusters are fixed.The main problem of this method is that if the number of clusters is to be chosen small then there is a higher probability of adding dissimilar items into the same group. On the other hand, if the number of clusters is chosen to be high, then there is a higher chance of adding similar items in the different groups. In this paper, we address this issue by proposing a new K-Means clustering algorithm. The proposed method performs data clustering dynamically. The proposed method initially calculates a threshold value as a centroid of K-Means and based on this value the number of clusters are formed. At each iteration of K-Means, if the Euclidian distance between two points is less than or equal to the threshold value, then these two data points will be in the same group. Otherwise, the proposed method will create a new cluster with the dissimilar data point. The results show that the proposed method outperforms the original K-Means method.</span>


2017 ◽  
Vol 13 (2) ◽  
pp. 1-12 ◽  
Author(s):  
Jungmok Ma

One of major obstacles in the application of the k-means clustering algorithm is the selection of the number of clusters k. The multi-attribute utility theory (MAUT)-based k-means clustering algorithm is proposed to tackle the problem by incorporating user preferences. Using MAUT, the decision maker's value structure for the number of clusters and other attributes can be quantitatively modeled, and it can be used as an objective function of the k-means. A target clustering problem for military targeting process is used to demonstrate the MAUT-based k-means and provide a comparative study. The result shows that the existing clustering algorithms do not necessarily reflect user preferences while the MAUT-based k-means provides a systematic framework of preferences modeling in cluster analysis.


Author(s):  
Bambang Riyanto

The health office is in charge of instructing and registering diarrhea sufferers in each region, then the area will be evaluated which areas are most affected by diarrhea. And checking directly into the field revealed that the most basic cause was about the unclean environment such as trenches that were too much garbage, causing floods during the rainy season. The health office also encourages the community to always maintain environmental cleanliness and familiarize people to always wash their hands with soap before eating and after cleaning with simple things like this is expected to help reduce diarrhea sufferers in the city of Medan. K-Medoids Clustering is clustering algorithm which is similar to K-Means. The difference between these two algorithms is the K-Medoids or PAM algorithm uses the object as a representative (medoid) as the center of the cluster for each cluster, while the K-Means uses the mean (mean) as the center of the cluster.Keywords: Diarrhea, Service office, Data mining, K-Medoids Algorithm


2017 ◽  
Vol 13 (8) ◽  
pp. 155014771772862 ◽  
Author(s):  
Jianpeng Qi ◽  
Yanwei Yu ◽  
Lihong Wang ◽  
Jinglei Liu ◽  
Yingjie Wang

K-means plays an important role in different fields of data mining. However, k-means often becomes sensitive due to its random seeds selecting. Motivated by this, this article proposes an optimized k-means clustering method, named k*-means, along with three optimization principles. First, we propose a hierarchical optimization principle initialized by k* seeds ([Formula: see text]) to reduce the risk of random seeds selecting, and then use the proposed “top- n nearest clusters merging” to merge the nearest clusters in each round until the number of clusters reaches at [Formula: see text]. Second, we propose an “optimized update principle” that leverages moved points updating incrementally instead of recalculating mean and [Formula: see text] of cluster in k-means iteration to minimize computation cost. Third, we propose a strategy named “cluster pruning strategy” to improve efficiency of k-means. This strategy omits the farther clusters to shrink the adjustable space in each iteration. Experiments performed on real UCI and synthetic datasets verify the efficiency and effectiveness of our proposed algorithm.


2015 ◽  
Vol 4 (2) ◽  
pp. 231 ◽  
Author(s):  
Omar Kettani ◽  
Faical Ramdani ◽  
Benaissa Tadili

<p>In data mining, K-means is a simple and fast algorithm for solving clustering problems, but it requires that the user provides in advance the exact number of clusters (k), which is often not obvious. Thus, this paper intends to overcome this problem by proposing a parameter-free algorithm for automatic clustering. It is based on successive adequate restarting of K-means algorithm. Experiments conducted on several standard data sets demonstrate that the proposed approach is effective and outperforms the related well known algorithm G-means, in terms of clustering accuracy and estimation of the correct number of clusters.</p>


Sign in / Sign up

Export Citation Format

Share Document