Clustering Categorical Data: A Survey

Author(s):  
Sami Naouali ◽  
Semeh Ben Salem ◽  
Zied Chtourou

Clustering is a complex unsupervised method used to group most similar observations of a given dataset within the same cluster. To guarantee high efficiency, the clustering process should ensure high accuracy and low complexity. Many clustering methods were developed in various fields depending on the type of application and the data type considered. Categorical clustering considers segmenting a dataset in which the data are categorical and were widely used in many real-world applications. Thus several methods were developed including hard, fuzzy and rough set-based methods. In this survey, more than 30 categorical clustering algorithms were investigated. These methods were classified into hierarchical and partitional clustering methods and classified in terms of their accuracy, precision and recall to identify the most prominent ones. Experimental results show that rough set-based clustering methods provided better efficiency than hard and fuzzy methods. Besides, methods based on the initialization of the centroids also provided good results.

Author(s):  
B.K. Tripathy ◽  
Adhir Ghosh

Developing Data Clustering algorithms have been pursued by researchers since the introduction of k-means algorithm (Macqueen 1967; Lloyd 1982). These algorithms were subsequently modified to handle categorical data. In order to handle the situations where objects can have memberships in multiple clusters, fuzzy clustering and rough clustering methods were introduced (Lingras et al 2003, 2004a). There are many extensions of these initial algorithms (Lingras et al 2004b; Lingras 2007; Mitra 2004; Peters 2006, 2007). The MMR algorithm (Parmar et al 2007), its extensions (Tripathy et al 2009, 2011a, 2011b) and the MADE algorithm (Herawan et al 2010) use rough set techniques for clustering. In this chapter, the authors focus on rough set based clustering algorithms and provide a comparative study of all the fuzzy set based and rough set based clustering algorithms in terms of their efficiency. They also present problems for future studies in the direction of the topics covered.


2019 ◽  
Vol 8 (4) ◽  
pp. 84-100
Author(s):  
Akarsh Goyal ◽  
Rahul Chowdhury

In recent times, an enumerable number of clustering algorithms have been developed whose main function is to make sets of objects have almost the same features. But due to the presence of categorical data values, these algorithms face a challenge in their implementation. Also, some algorithms which are able to take care of categorical data are not able to process uncertainty in the values and therefore have stability issues. Thus, handling categorical data along with uncertainty has been made necessary owing to such difficulties. So, in 2007 an MMR algorithm was developed which was based on basic rough set theory. MMeR was proposed in 2009 which surpassed the results of MMR in taking care of categorical data but cannot be used robustly for hybrid data. In this article, the authors generalize the MMeR algorithm with neighborhood relations and make it a neighborhood rough set model which this article calls MMeNR (Min Mean Neighborhood Roughness). It takes care of the heterogeneous data. Also, the authors have extended the MMeNR method to make it suitable for various applications like geospatial data analysis and epidemiology.


2013 ◽  
Vol 2013 ◽  
pp. 1-9 ◽  
Author(s):  
Ali Seman ◽  
Zainab Abu Bakar ◽  
Mohamed Nizam Isa

The Y-chromosome short tandem repeat (Y-STR) data are mainly collected for a performance benchmarking result in clustering methods. There are six Y-STR dataset items, divided into two categories: Y-STR surname and Y-haplogroup data presented here. The Y-STR data are categorical, unique, and different from the other categorical data. They are composed of a lot of similar and almost similar objects. This characteristic of the Y-STR data has caused certain problems of the existing clustering algorithms in clustering them.


Author(s):  
B. K. Tripathy

Publication of Data owned by various organizations for scientific research has the danger of sensitive information of respondents being disclosed. The policy of removal or encryption of identifiers cannot avoid the leakage of information through quasi-identifiers. So, several anonymization techniques like k-anonymity, l-diversity, and t-closeness have been proposed. However, uncertainty in data cannot be handled by these algorithms. One solution to this is to develop anonymization algorithms by using rough set based clustering algorithms like MMR, MMeR, SDR, SSDR, and MADE at the clustering stage of existing algorithms. Some of these algorithms handle both numerical and categorical data. In this chapter, the author addresses the database anonymization problem and briefly discusses k-anonymization methods. The primary focus is on the algorithms dealing with l-diversity of databases having single or multi-sensitive attributes. The author also proposes certain algorithms to deal with anonymization of databases with involved uncertainty. Also, the aim is to draw attention of researchers towards the various open problems in this direction.


Author(s):  
Naohiko Kinoshita ◽  
◽  
Yasunori Endo ◽  

Clustering is one of the most popular unsupervised classification methods. In this paper, we focus on rough clustering methods based on rough-set representation. Rough k-Means (RKM) is one of the rough clustering method proposed by Lingras et al. Outputs of many clustering algorithms, including RKM depend strongly on initial values, so we must evaluate the validity of outputs. In the case of objectivebased clustering algorithms, the objective function is handled as the measure. It is difficult, however to evaluate the output in RKM, which is not objective-based. To solve this problem, we propose new objective-based rough clustering algorithms and verify theirs usefulness through numerical examples.


2015 ◽  
Vol 2015 ◽  
pp. 1-9 ◽  
Author(s):  
Mansooreh Mirzaie ◽  
Ahmad Barani ◽  
Naser Nematbakkhsh ◽  
Majid Mohammad-Beigi

Although most research in density-based clustering algorithms focused on finding distinct clusters, many real-world applications (such as gene functions in a gene regulatory network) have inherently overlapping clusters. Even with overlapping features, density-based clustering methods do not define a probabilistic model of data. Therefore, it is hard to determine how “good” clustering, predicting, and clustering new data into existing clusters are. Therefore, a probability model for overlap density-based clustering is a critical need for large data analysis. In this paper, a new Bayesian density-based method (Bayesian-OverDBC) for modeling the overlapping clusters is presented. Bayesian-OverDBC can predict the formation of a new cluster. It can also predict the overlapping of cluster with existing clusters. Bayesian-OverDBC has been compared with other algorithms (nonoverlapping and overlapping models). The results show that Bayesian-OverDBC can be significantly better than other methods in analyzing microarray data.


2021 ◽  
Vol 20 ◽  
pp. 177-184
Author(s):  
Ozer Ozdemir ◽  
Simgenur Cerman

In data mining, one of the commonly-used techniques is the clustering. Clustering can be done by the different algorithms such as hierarchical, partitioning, grid, density and graph based algorithms. In this study first of all the concept of data mining explained, then giving information the aims of using data mining and the areas of using and then clustering and clustering algorithms that used in data mining are explained theoretically. Ultimately within the scope of this study, "Mall Customers" data set that taken from Kaggle database, based partitioned clustering and hierarchical clustering algorithms aimed at the separation of clusters according to their costumers features. In the clusters obtained by the partitional clustering algorithms, the similarity within the cluster is maximum and the similarity between the clusters is minimum. The hierarchical clustering algorithms is based on the gathering of similar features or vice versa. The partitional clustering algorithms used; k-means and PAM, hierarchical clustering algorithms used; AGNES and DIANA are algorithms. In this study, R statistical programming language was used in the application of algorithms. At the end of the study, the data set was run with clustering algorithms and the obtained analysis results were interpreted.


Author(s):  
XICHEN SUN ◽  
QIANSHENG CHENG ◽  
JUFU FENG

A unified probabilistic framework (UPF) of partitional clustering algorithms is proposed based on Penalized Maximum Likelihood. Besides Gaussian Mixture model methods, many popular clustering methods, such as Fuzzy c-Means Algorithm (FCM), Attribute Means Clustering (AMC), General c-Means Clustering (GCM), and Deterministic Annealing (DA) Clustering can be explained as special cases within UPF. Furthermore, this UPF framework provides a general approach to design comparatively stable and effectively regularized clustering algorithms.


Processes ◽  
2020 ◽  
Vol 8 (10) ◽  
pp. 1326
Author(s):  
Zhenni Jiang ◽  
Xiyu Liu

In this paper, a data clustering method named consensus fuzzy k-modes clustering is proposed to improve the performance of the clustering for the categorical data. At the same time, the coupling DNA-chain-hypergraph P system is constructed to realize the process of the clustering. This P system can prevent the clustering algorithm falling into the local optimum and realize the clustering process in implicit parallelism. The consensus fuzzy k-modes algorithm can combine the advantages of the fuzzy k-modes algorithm, weight fuzzy k-modes algorithm and genetic fuzzy k-modes algorithm. The fuzzy k-modes algorithm can realize the soft partition which is closer to reality, but treats all the variables equally. The weight fuzzy k-modes algorithm introduced the weight vector which strengthens the basic k-modes clustering by associating higher weights with features useful in analysis. These two methods are only improvements the k-modes algorithm itself. So, the genetic k-modes algorithm is proposed which used the genetic operations in the clustering process. In this paper, we examine these three kinds of k-modes algorithms and further introduce DNA genetic optimization operations in the final consensus process. Finally, we conduct experiments on the seven UCI datasets and compare the clustering results with another four categorical clustering algorithms. The experiment results and statistical test results show that our method can get better clustering results than the compared clustering algorithms, respectively.


2001 ◽  
Vol 9 (4) ◽  
pp. 595-607 ◽  
Author(s):  
R. Krishnapuram ◽  
A. Joshi ◽  
O. Nasraoui ◽  
L. Yi

Sign in / Sign up

Export Citation Format

Share Document