Efficient Data Clustering using Fast Choice for Number of Clusters

Author(s):  
Sung-Soo Kim ◽  
Bum-Su Kang
Author(s):  
Ahmed Fahim ◽  

The k-means is the most well-known algorithm for data clustering in data mining. Its simplicity and speed of convergence to local minima are the most important advantages of it, in addition to its linear time complexity. The most important open problems in this algorithm are the selection of initial centers and the determination of the exact number of clusters in advance. This paper proposes a solution for these two problems together; by adding a preprocess step to get the expected number of clusters in data and better initial centers. There are many researches to solve each of these problems separately, but there is no research to solve both problems together. The preprocess step requires o(n log n); where n is size of the dataset. This preprocess step aims to get initial portioning of data without determining the number of clusters in advance, then computes the means of initial clusters. After that we apply k-means on original data using the resulting information from the preprocess step to get the final clusters. We use many benchmark datasets to test the proposed method. The experimental results show the efficiency of the proposed method.


2012 ◽  
Vol 3 (1) ◽  
pp. 1-20
Author(s):  
Amit Banerjee

In this paper, a multi-objective genetic algorithm for data clustering based on the robust fuzzy least trimmed squares estimator is presented. The proposed clustering methodology addresses two critical issues in unsupervised data clustering – the ability to produce meaningful partition in noisy data, and the requirement that the number of clusters be known a priori. The multi-objective genetic algorithm-driven clustering technique optimizes the number of clusters as well as cluster assignment, and cluster prototypes. A two-parameter, mapped, fixed point coding scheme is used to represent assignment of data into the true retained set and the noisy trimmed set, and the optimal number of clusters in the retained set. A three-objective criterion is also used as the minimization functional for the multi-objective genetic algorithm. Results on well-known data sets from literature suggest that the proposed methodology is superior to conventional fuzzy clustering algorithms that assume a known value for optimal number of clusters.


Algorithms ◽  
2018 ◽  
Vol 11 (11) ◽  
pp. 177 ◽  
Author(s):  
Xuedong Gao ◽  
Minghan Yang

Clustering is one of the main tasks of machine learning. Internal clustering validation indexes (CVIs) are used to measure the quality of several clustered partitions to determine the local optimal clustering results in an unsupervised manner, and can act as the objective function of clustering algorithms. In this paper, we first studied several well-known internal CVIs for categorical data clustering, and proved the ineffectiveness of evaluating the partitions of different numbers of clusters without any inter-cluster separation measures or assumptions; the accurateness of separation, along with its coordination with the intra-cluster compactness measures, can notably affect performance. Then, aiming to enhance the internal clustering validation measurement, we proposed a new internal CVI—clustering utility based on the averaged information gain of isolating each cluster (CUBAGE)—which measures both the compactness and the separation of the partition. The experimental results supported our findings with regard to the existing internal CVIs, and showed that the proposed CUBAGE outperforms other internal CVIs with or without a pre-known number of clusters.


2016 ◽  
Vol 25 (4) ◽  
pp. 595-610 ◽  
Author(s):  
Vijay Kumar ◽  
Jitender Kumar Chhabra ◽  
Dinesh Kumar

AbstractIn this paper, the problem of automatic data clustering is treated as the searching of optimal number of clusters so that the obtained partitions should be optimized. The automatic data clustering technique utilizes a recently developed parameter adaptive harmony search (PAHS) as an underlying optimization strategy. It uses real-coded variable length harmony vector, which is able to detect the number of clusters automatically. The newly developed concepts regarding “threshold setting” and “cutoff” are used to refine the optimization strategy. The assignment of data points to different cluster centers is done based on the newly developed weighted Euclidean distance instead of Euclidean distance. The developed approach is able to detect any type of cluster irrespective of their geometric shape. It is compared with four well-established clustering techniques. It is further applied for automatic segmentation of grayscale and color images, and its performance is compared with other existing techniques. For real-life datasets, statistical analysis is done. The technique shows its effectiveness and the usefulness.


2011 ◽  
Vol 14 (3) ◽  
Author(s):  
Thelma Elita Colanzi ◽  
Wesley Klewerton Guez Assunção ◽  
Aurora Trinidad Ramirez Pozo ◽  
Ana Cristina B. Kochem Vendramin ◽  
Diogo Augusto Barros Pereira ◽  
...  

Clustering analysis includes a number of different algorithms and methods for grouping objects by their similar characteristics into categories. In recent years, considerable effort has been made to improve such algorithms performance. In this sense, this paper explores three different bio-inspired metaheuristics in the clustering problem: Genetic Algorithms (GAs), Ant Colony Optimization (ACO), and Artificial Immune Systems (AIS). This paper proposes some refinements to be applied to these metaheuristics in order to improve their performance in the data clustering problem. The performance of the proposed algorithms is compared on five different numeric UCI databases. The results show that GA, ACO and AIS based algorithms are able to efficiently and automatically forming natural groups from a pre-defined number of clusters.


IEEE Access ◽  
2019 ◽  
Vol 7 ◽  
pp. 50347-50361 ◽  
Author(s):  
Neha Bharill ◽  
Om Prakash Patel ◽  
Aruna Tiwari ◽  
Lifeng Mu ◽  
Dong-Lin Li ◽  
...  

Sign in / Sign up

Export Citation Format

Share Document