scholarly journals A Comparison of Categorical Attribute Data Clustering Methods

Author(s):  
Ville Hautamäki ◽  
Antti Pöllänen ◽  
Tomi Kinnunen ◽  
Kong Aik Lee ◽  
Haizhou Li ◽  
...  
Author(s):  
Shihua Liu ◽  
Hao Zhang ◽  
Xianghua Liu

A Two-stage clustering framework and a clustering algorithm for mixed attribute data based on density peaks and Goodall distance are proposed. Firstly, the subset of numerical attributes of the dataset is clustered, and then the result is mapped into one-dimensional categorical attribute and added to the subset of categorical attribute data. Finally, the new dataset is clustered by the density peaks clustering algorithm to obtain the final result. Experiments on three commonly used UCI datasets show that this algorithm can effectively realize mixed attribute clustering and produce better clustering results than the traditional K-prototypes algorithm do. The clustering accuracy on the Acute, Heart and Credit datasets are 17%, 24%, and 21% higher on average than that of the K-prototypes, respectively.


2018 ◽  
Vol 6 (2) ◽  
Author(s):  
Elly Muningsih - AMIK BSI Yogyakarta

Abstract ~ The K-Means method is one of the clustering methods that is widely used in data clustering research. While the K-Medoids method is an efficient method used for processing small data. This study aims to compare two clustering methods by grouping customers into 3 clusters according to their characteristics, namely very potential (loyal) customers, potential customers and non potential customers. The method used in this study is the K-Means clustering method and the K-Medoids method. The data used is online sales transaction. The clustering method testing is done by using a Fuzzy RFM (Recency, Frequenty and Monetary) model where the average (mean) of the third value is taken. From the data testing is known that the K-Means method is better than the K-Medoids method with an accuracy value of 90.47%. Whereas from the data processing carried out is known that cluster 1 has 16 members (customers), cluster 2 has 11 members and cluster 3 has 15 members. Keywords : clustering, K-Means method, K-Medoids method, customer, Fuzzy RFM model. Abstrak ~ Metode K-Means merupakan salah satu metode clustering yang banyak digunakan dalam penelitian pengelompokan data. Sedangkan metode K-Medoids merupakan metode yang efisien digunakan untuk pengolahan data yang kecil. Penelitian ini bertujuan untuk membandingkan atau mengkomparasi dua metode clustering dengan cara mengelompokkan pelanggan menjadi 3 cluster sesuai dengan karakteristiknya, yaitu pelanggan sangat potensial (loyal), pelanggan potensial dan pelanggan kurang (tidak) potensial. Metode yang digunakan dalam penelitian ini adalah metode clustering K-Means dan metode K-Medoids. Data yang digunakan adalah data transaksi penjualan online. Pengujian metode clustering yang dilakukan adalah dengan menggunakan model Fuzzy RFM (Recency, Frequenty dan Monetary) dimana diambil rata-rata (mean) dari nilai ketiga tersebut. Dari pengujian data diketahui bahwa metode K-Means lebih baik dari metode K-Medoids dengan nilai akurasi 90,47%. Sedangkan dari pengolahan data yang dilakukan diketahui bahwa cluster 1 memiliki 16 anggota (pelanggan), cluster 2 memiliki 11 anggota dan cluster 3 memiliki 15 anggota. Kata kunci : clustering, metode K-Means, metode K-Medoids, pelanggan, model Fuzzy RFM.


Data clustering is an active topic of research as it has applications in various fields such as biology, management, statistics, pattern recognition, etc. Spectral Clustering (SC) has gained popularity in recent times due to its ability to handle complex data and ease of implementation. A crucial step in spectral clustering is the construction of the affinity matrix, which is based on a pairwise similarity measure. The varied characteristics of datasets affect the performance of a spectral clustering technique. In this paper, we have proposed an affinity measure based on Topological Node Features (TNFs) viz., Clustering Coefficient (CC) and Summation index (SI) to define the notion of density and local structure. It has been shown that these features improve the performance of SC in clustering the data. The experiments were conducted on synthetic datasets, UCI datasets, and the MNIST handwritten datasets. The results show that the proposed affinity metric outperforms several recent spectral clustering methods in terms of accuracy.


Author(s):  
B.K. Tripathy ◽  
Adhir Ghosh

Developing Data Clustering algorithms have been pursued by researchers since the introduction of k-means algorithm (Macqueen 1967; Lloyd 1982). These algorithms were subsequently modified to handle categorical data. In order to handle the situations where objects can have memberships in multiple clusters, fuzzy clustering and rough clustering methods were introduced (Lingras et al 2003, 2004a). There are many extensions of these initial algorithms (Lingras et al 2004b; Lingras 2007; Mitra 2004; Peters 2006, 2007). The MMR algorithm (Parmar et al 2007), its extensions (Tripathy et al 2009, 2011a, 2011b) and the MADE algorithm (Herawan et al 2010) use rough set techniques for clustering. In this chapter, the authors focus on rough set based clustering algorithms and provide a comparative study of all the fuzzy set based and rough set based clustering algorithms in terms of their efficiency. They also present problems for future studies in the direction of the topics covered.


Author(s):  
Dušan Husek ◽  
Jaroslav Pokorny ◽  
Hana Rezankova ◽  
Václav Snasel

Document and information retrieval (IR) is an important task for Web communities. In this chapter, we introduce some clustering methods and focus on their use for the clustering, classification, and retrieval of Web documents.


2020 ◽  
Vol 18 (04) ◽  
pp. 2040005
Author(s):  
Ruiyi Li ◽  
Jihong Guan ◽  
Shuigeng Zhou

Clustering analysis has been widely applied to single-cell RNA-sequencing (scRNA-seq) data to discover cell types and cell states. Algorithms developed in recent years have greatly helped the understanding of cellular heterogeneity and the underlying mechanisms of biological processes. However, these algorithms often use different techniques, were evaluated on different datasets and compared with some of their counterparts usually using different performance metrics. Consequently, there lacks an accurate and complete picture of their merits and demerits, which makes it difficult for users to select proper algorithms for analyzing their data. To fill this gap, we first do a review on the major existing scRNA-seq data clustering methods, and then conduct a comprehensive performance comparison among them from multiple perspectives. We consider 13 state of the art scRNA-seq data clustering algorithms, and collect 12 publicly available real scRNA-seq datasets from the existing works to evaluate and compare these algorithms. Our comparative study shows that the existing methods are very diverse in performance. Even the top-performance algorithms do not perform well on all datasets, especially those with complex structures. This suggests that further research is required to explore more stable, accurate, and efficient clustering algorithms for scRNA-seq data.


2013 ◽  
Vol 2013 ◽  
pp. 1-8 ◽  
Author(s):  
Amit Banerjee

Density-based clustering methods are known to be robust against outliers in data; however, they are sensitive to user-specified parameters, the selection of which is not trivial. Moreover, relational data clustering is an area that has received considerably less attention than object data clustering. In this paper, two approaches to robust density-based clustering for relational data using evolutionary computation are investigated.


Sign in / Sign up

Export Citation Format

Share Document