An empirical comparison between stochastic and deterministic centroid initialisation for K-means variations

Machine Learning ◽

10.1007/s10994-021-06021-7 ◽

2021 ◽

Author(s):

Avgoustinos Vouros ◽

Stephen Langdell ◽

Mike Croucher ◽

Eleni Vasilaki

Keyword(s):

Data Clustering ◽

Stochastic Methods ◽

Data Sets ◽

Local Minima ◽

Clustering Methods ◽

Empirical Comparison ◽

Real World Data ◽

Trade Off ◽

Deterministic Methods ◽

Better Than

AbstractK-Means is one of the most used algorithms for data clustering and the usual clustering method for benchmarking. Despite its wide application it is well-known that it suffers from a series of disadvantages; it is only able to find local minima and the positions of the initial clustering centres (centroids) can greatly affect the clustering solution. Over the years many K-Means variations and initialisation techniques have been proposed with different degrees of complexity. In this study we focus on common K-Means variations along with a range of deterministic and stochastic initialisation techniques. We show that, on average, more sophisticated initialisation techniques alleviate the need for complex clustering methods. Furthermore, deterministic methods perform better than stochastic methods. However, there is a trade-off: less sophisticated stochastic methods, executed multiple times, can result in better clustering. Factoring in execution time, deterministic methods can be competitive and result in a good clustering solution. These conclusions are obtained through extensive benchmarking using a range of synthetic model generators and real-world data sets.

Download Full-text

KOMPARASI METODE CLUSTERING K-MEANS DAN K-MEDOIDS DENGAN MODEL FUZZY RFM UNTUK PENGELOMPOKAN PELANGGAN

Evolusi : Jurnal Sains dan Manajemen ◽

10.31294/evolusi.v6i2.4600 ◽

2018 ◽

Vol 6 (2) ◽

Author(s):

Elly Muningsih - AMIK BSI Yogyakarta

Keyword(s):

Data Clustering ◽

Small Data ◽

Clustering Methods ◽

Monetary Model ◽

Clustering Method ◽

Online Sales ◽

Rfm Model ◽

Potential Customers ◽

Cluster 2 ◽

Better Than

Abstract ~ The K-Means method is one of the clustering methods that is widely used in data clustering research. While the K-Medoids method is an efficient method used for processing small data. This study aims to compare two clustering methods by grouping customers into 3 clusters according to their characteristics, namely very potential (loyal) customers, potential customers and non potential customers. The method used in this study is the K-Means clustering method and the K-Medoids method. The data used is online sales transaction. The clustering method testing is done by using a Fuzzy RFM (Recency, Frequenty and Monetary) model where the average (mean) of the third value is taken. From the data testing is known that the K-Means method is better than the K-Medoids method with an accuracy value of 90.47%. Whereas from the data processing carried out is known that cluster 1 has 16 members (customers), cluster 2 has 11 members and cluster 3 has 15 members. Keywords : clustering, K-Means method, K-Medoids method, customer, Fuzzy RFM model. Abstrak ~ Metode K-Means merupakan salah satu metode clustering yang banyak digunakan dalam penelitian pengelompokan data. Sedangkan metode K-Medoids merupakan metode yang efisien digunakan untuk pengolahan data yang kecil. Penelitian ini bertujuan untuk membandingkan atau mengkomparasi dua metode clustering dengan cara mengelompokkan pelanggan menjadi 3 cluster sesuai dengan karakteristiknya, yaitu pelanggan sangat potensial (loyal), pelanggan potensial dan pelanggan kurang (tidak) potensial. Metode yang digunakan dalam penelitian ini adalah metode clustering K-Means dan metode K-Medoids. Data yang digunakan adalah data transaksi penjualan online. Pengujian metode clustering yang dilakukan adalah dengan menggunakan model Fuzzy RFM (Recency, Frequenty dan Monetary) dimana diambil rata-rata (mean) dari nilai ketiga tersebut. Dari pengujian data diketahui bahwa metode K-Means lebih baik dari metode K-Medoids dengan nilai akurasi 90,47%. Sedangkan dari pengolahan data yang dilakukan diketahui bahwa cluster 1 memiliki 16 anggota (pelanggan), cluster 2 memiliki 11 anggota dan cluster 3 memiliki 15 anggota. Kata kunci : clustering, metode K-Means, metode K-Medoids, pelanggan, model Fuzzy RFM.

Download Full-text

Possibilistic Clustering Methods for Interval-Valued Data

International Journal of Uncertainty Fuzziness and Knowledge-Based Systems ◽

10.1142/s0218488514500135 ◽

2014 ◽

Vol 22 (02) ◽

pp. 263-291 ◽

Cited By ~ 2

Author(s):

Bruno Almeida Pimentel ◽

Renata M. C. R. De Souza

Keyword(s):

Credit Card ◽

Data Sets ◽

Clustering Methods ◽

Cluster Algorithms ◽

Research Areas ◽

Possibilistic Clustering ◽

Membership Value ◽

Type Data ◽

Interval Type ◽

Better Than

Outliers may have many anomalous causes, for example, credit card fraud, cyberintrusion or breakdown of a system. Several research areas and application domains have investigated this problem. The popular fuzzy c-means algorithm is sensitive to noise and outlying data. In contrast, the possibilistic partitioning methods are used to solve these problems and other ones. The goal of this paper is to introduce cluster algorithms for partitioning a set of symbolic interval-type data using the possibilistic approach. In addition, a new way of measuring the membership value, according to each feature, is proposed. Experiments with artificial and real symbolic interval-type data sets are used to evaluate the methods. The results of the proposed methods are better than the traditional soft clustering ones.

Download Full-text

On Fuzzy Non-Metric Model for Data with Tolerance and its Application to Incomplete Data Clustering

Journal of Advanced Computational Intelligence and Intelligent Informatics ◽

10.20965/jaciii.2016.p0571 ◽

2016 ◽

Vol 20 (4) ◽

pp. 571-579 ◽

Cited By ~ 1

Author(s):

Yasunori Endo ◽

◽

Tomoyuki Suzuki ◽

Naohiko Kinoshita ◽

Yukihiro Hamasuna ◽

...

Keyword(s):

Data Clustering ◽

Incomplete Data ◽

Clustering Algorithm ◽

Uncertain Data ◽

Data Sets ◽

Membership Degree ◽

Clustering Methods ◽

Clustering Method ◽

Numerical Examples ◽

Metric Model

The fuzzy non-metric model (FNM) is a representative non-hierarchical clustering method, which is very useful because the belongingness or the membership degree of each datum to each cluster can be calculated directly from the dissimilarities between data and the cluster centers are not used. However, the original FNM cannot handle data with uncertainty. In this study, we refer to the data with uncertainty as “uncertain data,” e.g., incomplete data or data that have errors. Previously, a methods was proposed based on the concept of a tolerance vector for handling uncertain data and some clustering methods were constructed according to this concept, e.g. fuzzyc-means for data with tolerance. These methods can handle uncertain data in the framework of optimization. Thus, in the present study, we apply the concept to FNM. First, we propose a new clustering algorithm based on FNM using the concept of tolerance, which we refer to as the fuzzy non-metric model for data with tolerance. Second, we show that the proposed algorithm can handle incomplete data sets. Third, we verify the effectiveness of the proposed algorithm based on comparisons with conventional methods for incomplete data sets in some numerical examples.

Download Full-text

Slowness as a Proxy for Temporal Predictability: An Empirical Comparison

Neural Computation ◽

10.1162/neco_a_01070 ◽

2018 ◽

Vol 30 (5) ◽

pp. 1151-1179 ◽

Cited By ~ 2

Author(s):

Björn Weghenkel ◽

Laurenz Wiskott

Keyword(s):

Feature Analysis ◽

Data Sets ◽

Empirical Comparison ◽

Real World Data ◽

Slow Feature Analysis ◽

Temporal Predictability ◽

Highly Correlated ◽

The Relationship ◽

Learned Features ◽

Special Case

The computational principles of slowness and predictability have been proposed to describe aspects of information processing in the visual system. From the perspective of slowness being a limited special case of predictability we investigate the relationship between these two principles empirically. On a collection of real-world data sets we compare the features extracted by slow feature analysis (SFA) to the features of three recently proposed methods for predictable feature extraction: forecastable component analysis, predictable feature analysis, and graph-based predictable feature analysis. Our experiments show that the predictability of the learned features is highly correlated, and, thus, SFA appears to effectively implement a method for extracting predictable features according to different measures of predictability.

Download Full-text

Affinity Learning for Mixed Data Clustering

Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2017/302 ◽

2017 ◽

Cited By ~ 1

Author(s):

Nan Li ◽

Longin Jan Latecki

Keyword(s):

Data Clustering ◽

Mixed Type ◽

Original Data ◽

Mixed Data ◽

Abstract Objects ◽

Data Sets ◽

Process Data ◽

Real World Data ◽

Specific Data ◽

Data Points

In this paper, we propose a novel affinity learning based framework for mixed data clustering, which includes: how to process data with mixed-type attributes, how to learn affinities between data points, and how to exploit the learned affinities for clustering. In the proposed framework, each original data attribute is represented with several abstract objects defined according to the specific data type and values. Each attribute value is transformed into the initial affinities between the data point and the abstract objects of attribute. We refine these affinities and infer the unknown affinities between data points by taking into account the interconnections among the attribute values of all data points. The inferred affinities between data points can be exploited for clustering. Alternatively, the refined affinities between data points and the abstract objects of attributes can be transformed into new data features for clustering. Experimental results on many real world data sets demonstrate that the proposed framework is effective for mixed data clustering.

Download Full-text

Distributed Pareto Optimization for Subset Selection

Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2018/207 ◽

2018 ◽

Cited By ~ 2

Author(s):

Chao Qian ◽

Guiying Li ◽

Chao Feng ◽

Ke Tang

Keyword(s):

Real World ◽

Large Scale ◽

State Of The Art ◽

Subset Selection ◽

Data Sets ◽

Mapreduce Framework ◽

Real World Data ◽

Real World Applications ◽

Approximation Guarantee ◽

Better Than

The subset selection problem that selects a few items from a ground set arises in many applications such as maximum coverage, influence maximization, sparse regression, etc. The recently proposed POSS algorithm is a powerful approximation solver for this problem. However, POSS requires centralized access to the full ground set, and thus is impractical for large-scale real-world applications, where the ground set is too large to be stored on one single machine. In this paper, we propose a distributed version of POSS (DPOSS) with a bounded approximation guarantee. DPOSS can be easily implemented in the MapReduce framework. Our extensive experiments using Spark, on various real-world data sets with size ranging from thousands to millions, show that DPOSS can achieve competitive performance compared with the centralized POSS, and is almost always better than the state-of-the-art distributed greedy algorithm RandGreeDi.

Download Full-text

Motif-based spectral clustering of weighted directed networks

Applied Network Science ◽

10.1007/s41109-020-00293-z ◽

2020 ◽

Vol 5 (1) ◽

Author(s):

William G. Underwood ◽

Andrew Elliott ◽

Mihai Cucuringu

Keyword(s):

Spectral Clustering ◽

Higher Order ◽

Order Structure ◽

Data Sets ◽

Clustering Methods ◽

Directed Networks ◽

Real World Data ◽

Diverse Range ◽

Weighted Networks ◽

Adjacency Matrices

Abstract Clustering is an essential technique for network analysis, with applications in a diverse range of fields. Although spectral clustering is a popular and effective method, it fails to consider higher-order structure and can perform poorly on directed networks. One approach is to capture and cluster higher-order structures using motif adjacency matrices. However, current formulations fail to take edge weights into account, and thus are somewhat limited when weight is a key component of the network under study.We address these shortcomings by exploring motif-based weighted spectral clustering methods. We present new and computationally useful matrix formulae for motif adjacency matrices on weighted networks, which can be used to construct efficient algorithms for any anchored or non-anchored motif on three nodes. In a very sparse regime, our proposed method can handle graphs with a million nodes and tens of millions of edges. We further use our framework to construct a motif-based approach for clustering bipartite networks.We provide comprehensive experimental results, demonstrating (i) the scalability of our approach, (ii) advantages of higher-order clustering on synthetic examples, and (iii) the effectiveness of our techniques on a variety of real world data sets; and compare against several techniques from the literature. We conclude that motif-based spectral clustering is a valuable tool for analysis of directed and bipartite weighted networks, which is also scalable and easy to implement.

Download Full-text

Big Data Clustering And Its Applications Examination

International Journal of Recent Technology and Engineering - 2 ◽

10.35940/ijrte.b1466.0982s1119 ◽

2019 ◽

Vol 8 (2S11) ◽

pp. 3687-3693

Keyword(s):

Data Mining ◽

Big Data ◽

Data Clustering ◽

Clustering Algorithms ◽

Large Data ◽

Data Sets ◽

Clustering Methods ◽

Time Saving ◽

Data Set ◽

The Many

Clustering is a type of mining process where the data set is categorized into various sub classes. Clustering process is very much essential in classification, grouping, and exploratory pattern of analysis, image segmentation and decision making. And we can explain about the big data as very large data sets which are examined computationally to show techniques and associations and also which is associated to the human behavior and their interactions. Big data is very essential for several organisations but in few cases very complex to store and it is also time saving. Hence one of the ways of overcoming these issues is to develop the many clustering methods, moreover it suffers from the large complexity. Data mining is a type of technique where the useful information is extracted, but the data mining models cannot utilized for the big data because of inherent complexity. The main scope here is to introducing a overview of data clustering divisions for the big data And also explains here few of the related work for it. This survey concentrates on the research of several clustering algorithms which are working basically on the elements of big data. And also the short overview of clustering algorithms which are grouped under partitioning, hierarchical, grid based and model based are seenClustering is major data mining and it is used for analyzing the big data.the problems for applying clustering patterns to big data and also we phase new issues come up with big data

Download Full-text

An Incremental K-means algorithm

Proceedings of the Institution of Mechanical Engineers Part C Journal of Mechanical Engineering Science ◽

10.1243/0954406041319509 ◽

2004 ◽

Vol 218 (7) ◽

pp. 783-795 ◽

Cited By ~ 42

Author(s):

D T Pham ◽

S S Dimov ◽

C D Nguyen

Keyword(s):

Data Clustering ◽

Group Technology ◽

Family Formation ◽

Data Exploration ◽

Test Results ◽

Local Minima ◽

Clustering Methods ◽

Important Data ◽

Cluster Centre ◽

Distortion Reduction

Data clustering is an important data exploration technique with many applications in engineering, including parts family formation in group technology and segmentation in image processing. One of the most popular data clustering methods is K-means clustering because of its simplicity and computational efficiency. The main problem with this clustering method is its tendency to coverge at a local minimum. In this paper, the cause of this problem is explained and an existing solution involving a cluster centre jumping operation is examined. The jumping technique alleviates the problem with local minima by enabling cluster centres to move in such a radical way as to reduce the overall cluster distortion. However, the method is very sensitive to errors in estimating distortion. A clustering scheme that is also based on distortion reduction through cluster centre movement but is not so sensitive to inaccuracies in distortion estimation is proposed in this paper. The scheme, which is an incremental version of the K-means algorithm, involves adding cluster centres one by one as clusters are being formed. The paper presents test results to demonstrate the efficacy of the proposed algorithm.

Download Full-text

TrustSVD: A Novel Trust-Based Matrix Factorization Model with User Trust and Item Ratings

International Journal of Advanced Research in Computer Science and Software Engineering ◽

10.23956/ijarcsse.v7i11.422 ◽

2017 ◽

Vol 7 (11) ◽

pp. 7 ◽

Cited By ~ 1

Author(s):

K Sobha Rani

Keyword(s):

Matrix Factorization ◽

Social Trust ◽

State Of The Art ◽

Data Sets ◽

Real World Data ◽

Recommendation Algorithm ◽

Active User ◽

Factorization Model ◽

The Social ◽

Matrix Factorization Technique

Collaborative filtering suffers from the problems of data sparsity and cold start, which dramatically degrade recommendation performance. To help resolve these issues, we propose TrustSVD, a trust-based matrix factorization technique. By analyzing the social trust data from four real-world data sets, we conclude that not only the explicit but also the implicit influence of both ratings and trust should be taken into consideration in a recommendation model. Hence, we build on top of a state-of-the-art recommendation algorithm SVD++ which inherently involves the explicit and implicit influence of rated items, by further incorporating both the explicit and implicit influence of trusted users on the prediction of items for an active user. To our knowledge, the work reported is the first to extend SVD++ with social trust information. Experimental results on the four data sets demonstrate that our approach TrustSVD achieves better accuracy than other ten counterparts, and can better handle the concerned issues.

Download Full-text