Efficient Data Clustering using Fast Choice for Number of Clusters

The k-means is the most well-known algorithm for data clustering in data mining. Its simplicity and speed of convergence to local minima are the most important advantages of it, in addition to its linear time complexity. The most important open problems in this algorithm are the selection of initial centers and the determination of the exact number of clusters in advance. This paper proposes a solution for these two problems together; by adding a preprocess step to get the expected number of clusters in data and better initial centers. There are many researches to solve each of these problems separately, but there is no research to solve both problems together. The preprocess step requires o(n log n); where n is size of the dataset. This preprocess step aims to get initial portioning of data without determining the number of clusters in advance, then computes the means of initial clusters. After that we apply k-means on original data using the resulting information from the preprocess step to get the final clusters. We use many benchmark datasets to test the proposed method. The experimental results show the efficiency of the proposed method.

Download Full-text

Multi-Objective Genetic Algorithm for Robust Clustering with Unknown Number of Clusters

International Journal of Applied Evolutionary Computation ◽

10.4018/jaec.2012010101 ◽

2012 ◽

Vol 3 (1) ◽

pp. 1-20

Author(s):

Amit Banerjee

Keyword(s):

Genetic Algorithm ◽

Data Clustering ◽

Optimal Number ◽

Least Trimmed Squares ◽

Cluster Assignment ◽

Objective Criterion ◽

Number Of Clusters ◽

Multi Objective ◽

Multi Objective Genetic Algorithm ◽

Optimal Number Of Clusters

In this paper, a multi-objective genetic algorithm for data clustering based on the robust fuzzy least trimmed squares estimator is presented. The proposed clustering methodology addresses two critical issues in unsupervised data clustering – the ability to produce meaningful partition in noisy data, and the requirement that the number of clusters be known a priori. The multi-objective genetic algorithm-driven clustering technique optimizes the number of clusters as well as cluster assignment, and cluster prototypes. A two-parameter, mapped, fixed point coding scheme is used to represent assignment of data into the true retained set and the noisy trimmed set, and the optimal number of clusters in the retained set. A three-objective criterion is also used as the minimization functional for the multi-objective genetic algorithm. Results on well-known data sets from literature suggest that the proposed methodology is superior to conventional fuzzy clustering algorithms that assume a known value for optimal number of clusters.

Download Full-text

Understanding and Enhancement of Internal Clustering Validation Indexes for Categorical Data

Algorithms ◽

10.3390/a11110177 ◽

2018 ◽

Vol 11 (11) ◽

pp. 177 ◽

Cited By ~ 2

Author(s):

Xuedong Gao ◽

Minghan Yang

Keyword(s):

Machine Learning ◽

Categorical Data ◽

Data Clustering ◽

Information Gain ◽

Clustering Algorithms ◽

Number Of Clusters ◽

Cluster Compactness ◽

Clustering Validation ◽

Categorical Data Clustering

Clustering is one of the main tasks of machine learning. Internal clustering validation indexes (CVIs) are used to measure the quality of several clustered partitions to determine the local optimal clustering results in an unsupervised manner, and can act as the objective function of clustering algorithms. In this paper, we first studied several well-known internal CVIs for categorical data clustering, and proved the ineffectiveness of evaluating the partitions of different numbers of clusters without any inter-cluster separation measures or assumptions; the accurateness of separation, along with its coordination with the intra-cluster compactness measures, can notably affect performance. Then, aiming to enhance the internal clustering validation measurement, we proposed a new internal CVI—clustering utility based on the averaged information gain of isolating each cluster (CUBAGE)—which measures both the compactness and the separation of the partition. The experimental results supported our findings with regard to the existing internal CVIs, and showed that the proposed CUBAGE outperforms other internal CVIs with or without a pre-known number of clusters.

Download Full-text

A novel hybrid knowledge of firefly and pso swarm intelligence algorithms for efficient data clustering

Journal of Intelligent & Fuzzy Systems ◽

10.3233/jifs-17170 ◽

2017 ◽

Vol 33 (6) ◽

pp. 3529-3538 ◽

Cited By ~ 4

Author(s):

Malihe Danesh ◽

Hossein Shirgahi

Keyword(s):

Swarm Intelligence ◽

Data Clustering ◽

Hybrid Knowledge ◽

Efficient Data

Download Full-text

Automatic Data Clustering Using Parameter Adaptive Harmony Search Algorithm and Its Application to Image Segmentation

Journal of Intelligent Systems ◽

10.1515/jisys-2015-0004 ◽

2016 ◽

Vol 25 (4) ◽

pp. 595-610 ◽

Cited By ~ 8

Author(s):

Vijay Kumar ◽

Jitender Kumar Chhabra ◽

Dinesh Kumar

Keyword(s):

Data Clustering ◽

Euclidean Distance ◽

Search Algorithm ◽

Harmony Search ◽

Real Life ◽

Optimal Number ◽

Optimization Strategy ◽

Number Of Clusters ◽

Automatic Data ◽

Parameter Adaptive

AbstractIn this paper, the problem of automatic data clustering is treated as the searching of optimal number of clusters so that the obtained partitions should be optimized. The automatic data clustering technique utilizes a recently developed parameter adaptive harmony search (PAHS) as an underlying optimization strategy. It uses real-coded variable length harmony vector, which is able to detect the number of clusters automatically. The newly developed concepts regarding “threshold setting” and “cutoff” are used to refine the optimization strategy. The assignment of data points to different cluster centers is done based on the newly developed weighted Euclidean distance instead of Euclidean distance. The developed approach is able to detect any type of cluster irrespective of their geometric shape. It is compared with four well-established clustering techniques. It is further applied for automatic segmentation of grayscale and color images, and its performance is compared with other existing techniques. For real-life datasets, statistical analysis is done. The technique shows its effectiveness and the usefulness.

Download Full-text

Application of Bio-inspired Metaheuristics in the Data Clustering Problem

CLEI electronic journal ◽

10.19153/cleiej.14.3.5 ◽

2011 ◽

Vol 14 (3) ◽

Cited By ~ 2

Author(s):

Thelma Elita Colanzi ◽

Wesley Klewerton Guez Assunção ◽

Aurora Trinidad Ramirez Pozo ◽

Ana Cristina B. Kochem Vendramin ◽

Diogo Augusto Barros Pereira ◽

...

Keyword(s):

Genetic Algorithms ◽

Clustering Analysis ◽

Data Clustering ◽

Artificial Immune Systems ◽

Artificial Immune ◽

Considerable Effort ◽

Number Of Clusters ◽

Immune Systems ◽

Clustering Problem ◽

Natural Groups

Clustering analysis includes a number of different algorithms and methods for grouping objects by their similar characteristics into categories. In recent years, considerable effort has been made to improve such algorithms performance. In this sense, this paper explores three different bio-inspired metaheuristics in the clustering problem: Genetic Algorithms (GAs), Ant Colony Optimization (ACO), and Artificial Immune Systems (AIS). This paper proposes some refinements to be applied to these metaheuristics in order to improve their performance in the data clustering problem. The performance of the proposed algorithms is compared on five different numeric UCI databases. The results show that GA, ACO and AIS based algorithms are able to efficiently and automatically forming natural groups from a pre-defined number of clusters.

Download Full-text