An Empirical Comparison of Latest Data Clustering Algorithms with State-of-the-Art

Author(s):  
Xianjin Shi ◽  
Wanwan Wang ◽  
Chongsheng Zhang

Over the past few decades, a great many data clustering algorithms have been developed, including K-Means, DBSCAN, Bi-Clustering and Spectral clustering, etc. In recent years, two new data  clustering algorithms have been proposed, which are affinity propagation (AP, 2007) and density peak based clustering (DP, 2014). In this work, we empirically compare the performance of these two latest data clustering algorithms with state-of-the-art, using 6 external and 2 internal clustering validation metrics. Our experimental results on 16 public datasets show that, the two latest clustering algorithms, AP and DP, do not always outperform DBSCAN. Therefore, to find the best clustering algorithm for a specific dataset, all of AP, DP and DBSCAN should be considered.  Moreover, we find that the comparison of different clustering algorithms is closely related to the clustering evaluation metrics adopted. For instance, when using the Silhouette clustering validation metric, the overall performance of K-Means is as good as AP and DP. This work has important reference values for researchers and engineers who need to select appropriate clustering algorithms for their specific applications.

2020 ◽  
Vol 34 (04) ◽  
pp. 5867-5874
Author(s):  
Gan Sun ◽  
Yang Cong ◽  
Qianqian Wang ◽  
Jun Li ◽  
Yun Fu

In the past decades, spectral clustering (SC) has become one of the most effective clustering algorithms. However, most previous studies focus on spectral clustering tasks with a fixed task set, which cannot incorporate with a new spectral clustering task without accessing to previously learned tasks. In this paper, we aim to explore the problem of spectral clustering in a lifelong machine learning framework, i.e., Lifelong Spectral Clustering (L2SC). Its goal is to efficiently learn a model for a new spectral clustering task by selectively transferring previously accumulated experience from knowledge library. Specifically, the knowledge library of L2SC contains two components: 1) orthogonal basis library: capturing latent cluster centers among the clusters in each pair of tasks; 2) feature embedding library: embedding the feature manifold information shared among multiple related tasks. As a new spectral clustering task arrives, L2SC firstly transfers knowledge from both basis library and feature library to obtain encoding matrix, and further redefines the library base over time to maximize performance across all the clustering tasks. Meanwhile, a general online update formulation is derived to alternatively update the basis library and feature library. Finally, the empirical experiments on several real-world benchmark datasets demonstrate that our L2SC model can effectively improve the clustering performance when comparing with other state-of-the-art spectral clustering algorithms.


Symmetry ◽  
2021 ◽  
Vol 13 (4) ◽  
pp. 596
Author(s):  
Krishna Kumar Sharma ◽  
Ayan Seal ◽  
Enrique Herrera-Viedma ◽  
Ondrej Krejcar

Calculating and monitoring customer churn metrics is important for companies to retain customers and earn more profit in business. In this study, a churn prediction framework is developed by modified spectral clustering (SC). However, the similarity measure plays an imperative role in clustering for predicting churn with better accuracy by analyzing industrial data. The linear Euclidean distance in the traditional SC is replaced by the non-linear S-distance (Sd). The Sd is deduced from the concept of S-divergence (SD). Several characteristics of Sd are discussed in this work. Assays are conducted to endorse the proposed clustering algorithm on four synthetics, eight UCI, two industrial databases and one telecommunications database related to customer churn. Three existing clustering algorithms—k-means, density-based spatial clustering of applications with noise and conventional SC—are also implemented on the above-mentioned 15 databases. The empirical outcomes show that the proposed clustering algorithm beats three existing clustering algorithms in terms of its Jaccard index, f-score, recall, precision and accuracy. Finally, we also test the significance of the clustering results by the Wilcoxon’s signed-rank test, Wilcoxon’s rank-sum test, and sign tests. The relative study shows that the outcomes of the proposed algorithm are interesting, especially in the case of clusters of arbitrary shape.


2014 ◽  
Vol 687-691 ◽  
pp. 1350-1353
Author(s):  
Li Li Fu ◽  
Yong Li Liu ◽  
Li Jing Hao

Spectral clustering algorithm is a kind of clustering algorithm based on spectral graph theory. As spectral clustering has deep theoretical foundation as well as the advantage in dealing with non-convex distribution, it has received much attention in machine learning and data mining areas. The algorithm is easy to implement, and outperforms traditional clustering algorithms such as K-means algorithm. This paper aims to give some intuitions on spectral clustering. We describe different graph partition criteria, the definition of spectral clustering, and clustering steps, etc. Finally, in order to solve the disadvantage of spectral clustering, some improvements are introduced briefly.


Author(s):  
Hind Bangui ◽  
Mouzhi Ge ◽  
Barbora Buhnova

Due to the massive data increase in different Internet of Things (IoT) domains such as healthcare IoT and Smart City IoT, Big Data technologies have been emerged as critical analytics tools for analyzing the IoT data. Among the Big Data technologies, data clustering is one of the essential approaches to process the IoT data. However, how to select a suitable clustering algorithm for IoT data is still unclear. Furthermore, since Big Data technology are still in its initial stage for different IoT domains, it is thus valuable to propose and structure the research challenges between Big Data and IoT. Therefore, this article starts by reviewing and comparing the data clustering algorithms that can be applied in IoT datasets, and then extends the discussions to a broader IoT context such as IoT dynamics and IoT mobile networks. Finally, this article identifies a set of research challenges that harvest a research roadmap for the Big Data research in IoT domains. The proposed research roadmap aims at bridging the research gaps between Big Data and various IoT contexts.


Algorithms ◽  
2018 ◽  
Vol 11 (11) ◽  
pp. 177 ◽  
Author(s):  
Xuedong Gao ◽  
Minghan Yang

Clustering is one of the main tasks of machine learning. Internal clustering validation indexes (CVIs) are used to measure the quality of several clustered partitions to determine the local optimal clustering results in an unsupervised manner, and can act as the objective function of clustering algorithms. In this paper, we first studied several well-known internal CVIs for categorical data clustering, and proved the ineffectiveness of evaluating the partitions of different numbers of clusters without any inter-cluster separation measures or assumptions; the accurateness of separation, along with its coordination with the intra-cluster compactness measures, can notably affect performance. Then, aiming to enhance the internal clustering validation measurement, we proposed a new internal CVI—clustering utility based on the averaged information gain of isolating each cluster (CUBAGE)—which measures both the compactness and the separation of the partition. The experimental results supported our findings with regard to the existing internal CVIs, and showed that the proposed CUBAGE outperforms other internal CVIs with or without a pre-known number of clusters.


2020 ◽  
Vol 2020 ◽  
pp. 1-6
Author(s):  
Shuxia Ren ◽  
Shubo Zhang ◽  
Tao Wu

The similarity graphs of most spectral clustering algorithms carry lots of wrong community information. In this paper, we propose a probability matrix and a novel improved spectral clustering algorithm based on the probability matrix for community detection. First, the Markov chain is used to calculate the transition probability between nodes, and the probability matrix is constructed by the transition probability. Then, the similarity graph is constructed with the mean probability matrix. Finally, community detection is achieved by optimizing the NCut objective function. The proposed algorithm is compared with SC, WT, FG, FluidC, and SCRW on artificial networks and real networks. Experimental results show that the proposed algorithm can detect communities more accurately and has better clustering performance.


2011 ◽  
Vol 301-303 ◽  
pp. 1133-1138 ◽  
Author(s):  
Yan Xiang Fu ◽  
Wei Zhong Zhao ◽  
Hui Fang Ma

Data clustering has been received considerable attention in many applications, such as data mining, document retrieval, image segmentation and pattern classification. The enlarging volumes of information emerging by the progress of technology, makes clustering of very large scale of data a challenging task. In order to deal with the problem, more researchers try to design efficient parallel clustering algorithms. In this paper, we propose a parallel DBSCAN clustering algorithm based on Hadoop, which is a simple yet powerful parallel programming platform. The experimental results demonstrate that the proposed algorithm can scale well and efficiently process large datasets on commodity hardware.


Author(s):  
Hui Du ◽  
Yuping Wang ◽  
Xiaopan Dong

Clustering is a popular and effective method for image segmentation. However, existing cluster methods often suffer the following problems: (1) Need a huge space and a lot of computation when the input data are large. (2) Need to assign some parameters (e.g. number of clusters) in advance which will affect the clustering results greatly. To save the space and computation, reduce the sensitivity of the parameters, and improve the effectiveness and efficiency of the clustering algorithms, we construct a new clustering algorithm for image segmentation. The new algorithm consists of two phases: coarsening clustering and exact clustering. First, we use Affinity Propagation (AP) algorithm for coarsening. Specifically, in order to save the space and computational cost, we only compute the similarity between each point and its t nearest neighbors, and get a condensed similarity matrix (with only t columns, where t << N and N is the number of data points). Second, to further improve the efficiency and effectiveness of the proposed algorithm, the Self-tuning Spectral Clustering (SSC) is used to the resulted points (the representative points gotten in the first phase) to do the exact clustering. As a result, the proposed algorithm can quickly and precisely realize the clustering for texture image segmentation. The experimental results show that the proposed algorithm is more efficient than the compared algorithms FCM, K-means and SOM.


2018 ◽  
Vol 7 (1) ◽  
pp. 55-62
Author(s):  
Mohammad Alaqtash ◽  
Moayad A.Fadhil ◽  
Ali F. Al-Azzawi

Clustering is one of the important approaches for Clustering enables the grouping of unlabeled data by partitioning data into clusters with similar patterns. Over the past decades, many clustering algorithms have been developed for various clustering problems. An overlapping partitioning clustering (OPC) algorithm can only handle numerical data. Hence, novel clustering algorithms have been studied extensively to overcome this issue. By increasing the number of objects belonging to one cluster and distance between cluster centers, the study aimed to cluster the textual data type without losing the main functions. The proposed study herein included over twenty newsgroup dataset, which consisted of approximately 20000 textual documents. By introducing some modifications to the traditional algorithm, an acceptable level of homogeneity and completeness of clusters were generated. Modifications were performed on the pre-processing phase and data representation, along with the number methods which influence the primary function of the algorithm. Subsequently, the results were evaluated and compared with the k-means algorithm of the training and test datasets. The results indicated that the modified algorithm could successfully handle the categorical data and produce satisfactory clusters.


2021 ◽  
Author(s):  
Christian Nordahl ◽  
Veselka Boeva ◽  
Håkan Grahn ◽  
Marie Persson Netz

AbstractData has become an integral part of our society in the past years, arriving faster and in larger quantities than before. Traditional clustering algorithms rely on the availability of entire datasets to model them correctly and efficiently. Such requirements are not possible in the data stream clustering scenario, where data arrives and needs to be analyzed continuously. This paper proposes a novel evolutionary clustering algorithm, entitled EvolveCluster, capable of modeling evolving data streams. We compare EvolveCluster against two other evolutionary clustering algorithms, PivotBiCluster and Split-Merge Evolutionary Clustering, by conducting experiments on three different datasets. Furthermore, we perform additional experiments on EvolveCluster to further evaluate its capabilities on clustering evolving data streams. Our results show that EvolveCluster manages to capture evolving data stream behaviors and adapts accordingly.


Sign in / Sign up

Export Citation Format

Share Document