CLUSTERING USING SIMULATED ANNEALING WITH PROBABILISTIC REDISTRIBUTION

Author(s):  
SANGHAMITRA BANDYOPADHYAY ◽  
UJJWAL MAULIK ◽  
MALAY KUMAR PAKHIRA

An efficient partitional clustering technique, called SAKM-clustering, that integrates the power of simulated annealing for obtaining minimum energy configuration, and the searching capability of K-means algorithm is proposed in this article. The clustering methodology is used to search for appropriate clusters in multidimensional feature space such that a similarity metric of the resulting clusters is optimized. Data points are redistributed among the clusters probabilistically, so that points that are farther away from the cluster center have higher probabilities of migrating to other clusters than those which are closer to it. The superiority of the SAKM-clustering algorithm over the widely used K-means algorithm is extensively demonstrated for artificial and real life data sets.

Author(s):  
M. EMRE CELEBI ◽  
HASSAN A. KINGRAVI

K-means is undoubtedly the most widely used partitional clustering algorithm. Unfortunately, due to its gradient descent nature, this algorithm is highly sensitive to the initial placement of the cluster centers. Numerous initialization methods have been proposed to address this problem. Many of these methods, however, have superlinear complexity in the number of data points, making them impractical for large data sets. On the other hand, linear methods are often random and/or order-sensitive, which renders their results unrepeatable. Recently, Su and Dy proposed two highly successful hierarchical initialization methods named Var-Part and PCA-Part that are not only linear, but also deterministic (nonrandom) and order-invariant. In this paper, we propose a discriminant analysis based approach that addresses a common deficiency of these two methods. Experiments on a large and diverse collection of data sets from the UCI machine learning repository demonstrate that Var-Part and PCA-Part are highly competitive with one of the best random initialization methods to date, i.e. k-means++, and that the proposed approach significantly improves the performance of both hierarchical methods.


2013 ◽  
Vol 3 (4) ◽  
pp. 1-14 ◽  
Author(s):  
S. Sampath ◽  
B. Ramya

Cluster analysis is a branch of data mining, which plays a vital role in bringing out hidden information in databases. Clustering algorithms help medical researchers in identifying the presence of natural subgroups in a data set. Different types of clustering algorithms are available in the literature. The most popular among them is k-means clustering. Even though k-means clustering is a popular clustering method widely used, its application requires the knowledge of the number of clusters present in the given data set. Several solutions are available in literature to overcome this limitation. The k-means clustering method creates a disjoint and exhaustive partition of the data set. However, in some situations one can come across objects that belong to more than one cluster. In this paper, a clustering algorithm capable of producing rough clusters automatically without requiring the user to give as input the number of clusters to be produced. The efficiency of the algorithm in detecting the number of clusters present in the data set has been studied with the help of some real life data sets. Further, a nonparametric statistical analysis on the results of the experimental study has been carried out in order to analyze the efficiency of the proposed algorithm in automatic detection of the number of clusters in the data set with the help of rough version of Davies-Bouldin index.


2017 ◽  
Vol 26 (1) ◽  
pp. 153-168 ◽  
Author(s):  
Vijay Kumar ◽  
Jitender Kumar Chhabra ◽  
Dinesh Kumar

AbstractThe main problem of classical clustering technique is that it is easily trapped in the local optima. An attempt has been made to solve this problem by proposing the grey wolf algorithm (GWA)-based clustering technique, called GWA clustering (GWAC), through this paper. The search capability of GWA is used to search the optimal cluster centers in the given feature space. The agent representation is used to encode the centers of clusters. The proposed GWAC technique is tested on both artificial and real-life data sets and compared to six well-known metaheuristic-based clustering techniques. The computational results are encouraging and demonstrate that GWAC provides better values in terms of precision, recall, G-measure, and intracluster distances. GWAC is further applied for gene expression data set and its performance is compared to other techniques. Experimental results reveal the efficiency of the GWAC over other techniques.


Author(s):  
Eric U.O. ◽  
Michael O.O. ◽  
Oberhiri-Orumah G. ◽  
Chike H. N.

Cluster analysis is an unsupervised learning method that classifies data points, usually multidimensional into groups (called clusters) such that members of one cluster are more similar (in some sense) to each other than those in other clusters. In this paper, we propose a new k-means clustering method that uses Minkowski’s distance as its metric in a normed vector space which is the generalization of both the Euclidean distance and the Manhattan distance. The k-means clustering methods discussed in this paper are Forgy’s method, Lloyd’s method, MacQueen’s method, Hartigan and Wong’s method, Likas’ method and Faber’s method which uses the usual Euclidean distance. It was observed that the new k-means clustering method performed favourably in comparison with the existing methods in terms of minimization of the total intra-cluster variance using simulated data and real-life data sets.


2015 ◽  
Vol 2015 ◽  
pp. 1-8 ◽  
Author(s):  
Hong Peng ◽  
Xiaohui Luo ◽  
Zhisheng Gao ◽  
Jun Wang ◽  
Zheng Pei

P systems are a class of distributed parallel computing models; this paper presents a novel clustering algorithm, which is inspired from mechanism of a tissue-like P system with a loop structure of cells, called membrane clustering algorithm. The objects of the cells express the candidate centers of clusters and are evolved by the evolution rules. Based on the loop membrane structure, the communication rules realize a local neighborhood topology, which helps the coevolution of the objects and improves the diversity of objects in the system. The tissue-like P system can effectively search for the optimal partitioning with the help of its parallel computing advantage. The proposed clustering algorithm is evaluated on four artificial data sets and six real-life data sets. Experimental results show that the proposed clustering algorithm is superior or competitive tok-means algorithm and several evolutionary clustering algorithms recently reported in the literature.


2015 ◽  
Vol 2015 ◽  
pp. 1-12 ◽  
Author(s):  
Mingzhi Ma ◽  
Qifang Luo ◽  
Yongquan Zhou ◽  
Xin Chen ◽  
Liangliang Li

Animal migration optimization (AMO) is one of the most recently introduced algorithms based on the behavior of animal swarm migration. This paper presents an improved AMO algorithm (IAMO), which significantly improves the original AMO in solving complex optimization problems. Clustering is a popular data analysis and data mining technique and it is used in many fields. The well-known method in solving clustering problems isk-means clustering algorithm; however, it highly depends on the initial solution and is easy to fall into local optimum. To improve the defects of thek-means method, this paper used IAMO for the clustering problem and experiment on synthetic and real life data sets. The simulation results show that the algorithm has a better performance than that of thek-means, PSO, CPSO, ABC, CABC, and AMO algorithm for solving the clustering problem.


2016 ◽  
Vol 16 (6) ◽  
pp. 27-42 ◽  
Author(s):  
Minghan Yang ◽  
Xuedong Gao ◽  
Ling Li

Abstract Although Clustering Algorithm Based on Sparse Feature Vector (CABOSFV) and its related algorithms are efficient for high dimensional sparse data clustering, there exist several imperfections. Such imperfections as subjective parameter designation and order sensibility of clustering process would eventually aggravate the time complexity and quality of the algorithm. This paper proposes a parameter adjustment method of Bidirectional CABOSFV for optimization purpose. By optimizing Parameter Vector (PV) and Parameter Selection Vector (PSV) with the objective function of clustering validity, an improved Bidirectional CABOSFV algorithm using simulated annealing is proposed, which circumvents the requirement of initial parameter determination. The experiments on UCI data sets show that the proposed algorithm, which can perform multi-adjustment clustering, has a higher accurateness than single adjustment clustering, along with a decreased time complexity through iterations.


2011 ◽  
Vol 268-270 ◽  
pp. 811-816
Author(s):  
Yong Zhou ◽  
Yan Xing

Affinity Propagation(AP)is a new clustering algorithm, which is based on the similarity matrix between pairs of data points and messages are exchanged between data points until clustering result emerges. It is efficient and fast , and it can solve the clustering on large data sets. But the traditional Affinity Propagation has many limitations, this paper introduces the Affinity Propagation, and analyzes in depth the advantages and limitations of it, focuses on the improvements of the algorithm — improve the similarity matrix, adjust the preference and the damping-factor, combine with other algorithms. Finally, discusses the development of Affinity Propagation.


2008 ◽  
pp. 1231-1249
Author(s):  
Jaehoon Kim ◽  
Seong Park

Much of the research regarding streaming data has focused only on real time querying and analysis of recent data stream allowable in memory. However, as data stream mining, or tracking of past data streams, is often required, it becomes necessary to store large volumes of streaming data in stable storage. Moreover, as stable storage has restricted capacity, past data stream must be summarized. The summarization must be performed periodically because streaming data flows continuously, quickly, and endlessly. Therefore, in this paper, we propose an efficient periodic summarization method with a flexible storage allocation. It improves the overall estimation error by flexibly adjusting the size of the summarized data of each local time section. Additionally, as the processing overhead of compression and the disk I/O cost of decompression can be an important factor for quick summarization, we also consider setting the proper size of data stream to be summarized at a time. Some experimental results with artificial data sets as well as real life data show that our flexible approach is more efficient than the existing fixed approach.


Author(s):  
Ting Xie ◽  
Taiping Zhang

As a powerful unsupervised learning technique, clustering is the fundamental task of big data analysis. However, many traditional clustering algorithms for big data that is a collection of high dimension, sparse and noise data do not perform well both in terms of computational efficiency and clustering accuracy. To alleviate these problems, this paper presents Feature K-means clustering model on the feature space of big data and introduces its fast algorithm based on Alternating Direction Multiplier Method (ADMM). We show the equivalence of the Feature K-means model in the original space and the feature space and prove the convergence of its iterative algorithm. Computationally, we compare the Feature K-means with Spherical K-means and Kernel K-means on several benchmark data sets, including artificial data and four face databases. Experiments show that the proposed approach is comparable to the state-of-the-art algorithm in big data clustering.


Sign in / Sign up

Export Citation Format

Share Document