Clustering Categorical Data: A Survey

Clustering is a complex unsupervised method used to group most similar observations of a given dataset within the same cluster. To guarantee high efficiency, the clustering process should ensure high accuracy and low complexity. Many clustering methods were developed in various fields depending on the type of application and the data type considered. Categorical clustering considers segmenting a dataset in which the data are categorical and were widely used in many real-world applications. Thus several methods were developed including hard, fuzzy and rough set-based methods. In this survey, more than 30 categorical clustering algorithms were investigated. These methods were classified into hierarchical and partitional clustering methods and classified in terms of their accuracy, precision and recall to identify the most prominent ones. Experimental results show that rough set-based clustering methods provided better efficiency than hard and fuzzy methods. Besides, methods based on the initialization of the centroids also provided good results.

Download Full-text

Data Clustering Algorithms Using Rough Sets

Handbook of Research on Computational Intelligence for Engineering, Science, and Business ◽

10.4018/978-1-4666-2518-1.ch012 ◽

2013 ◽

pp. 297-327 ◽

Cited By ~ 6

Author(s):

B.K. Tripathy ◽

Adhir Ghosh

Keyword(s):

Comparative Study ◽

Rough Set ◽

Fuzzy Clustering ◽

Fuzzy Set ◽

Rough Sets ◽

Data Clustering ◽

Clustering Algorithms ◽

Clustering Methods ◽

Future Studies ◽

Multiple Clusters

Developing Data Clustering algorithms have been pursued by researchers since the introduction of k-means algorithm (Macqueen 1967; Lloyd 1982). These algorithms were subsequently modified to handle categorical data. In order to handle the situations where objects can have memberships in multiple clusters, fuzzy clustering and rough clustering methods were introduced (Lingras et al 2003, 2004a). There are many extensions of these initial algorithms (Lingras et al 2004b; Lingras 2007; Mitra 2004; Peters 2006, 2007). The MMR algorithm (Parmar et al 2007), its extensions (Tripathy et al 2009, 2011a, 2011b) and the MADE algorithm (Herawan et al 2010) use rough set techniques for clustering. In this chapter, the authors focus on rough set based clustering algorithms and provide a comparative study of all the fuzzy set based and rough set based clustering algorithms in terms of their efficiency. They also present problems for future studies in the direction of the topics covered.

Download Full-text

Clustering Hybrid Data Using a Neighborhood Rough Set Based Algorithm and Expounding its Application

International Journal of Fuzzy System Applications ◽

10.4018/ijfsa.2019100105 ◽

2019 ◽

Vol 8 (4) ◽

pp. 84-100

Author(s):

Akarsh Goyal ◽

Rahul Chowdhury

Keyword(s):

Rough Set ◽

Categorical Data ◽

Rough Set Theory ◽

Clustering Algorithms ◽

Heterogeneous Data ◽

Main Function ◽

Process Uncertainty ◽

Neighborhood Rough Set ◽

Hybrid Data ◽

Stability Issues

In recent times, an enumerable number of clustering algorithms have been developed whose main function is to make sets of objects have almost the same features. But due to the presence of categorical data values, these algorithms face a challenge in their implementation. Also, some algorithms which are able to take care of categorical data are not able to process uncertainty in the values and therefore have stability issues. Thus, handling categorical data along with uncertainty has been made necessary owing to such difficulties. So, in 2007 an MMR algorithm was developed which was based on basic rough set theory. MMeR was proposed in 2009 which surpassed the results of MMR in taking care of categorical data but cannot be used robustly for hybrid data. In this article, the authors generalize the MMeR algorithm with neighborhood relations and make it a neighborhood rough set model which this article calls MMeNR (Min Mean Neighborhood Roughness). It takes care of the heterogeneous data. Also, the authors have extended the MMeNR method to make it suitable for various applications like geospatial data analysis and epidemiology.

Download Full-text

First Y-Short Tandem Repeat Categorical Dataset for Clustering Applications

Dataset Papers in Biology ◽

10.7167/2013/364725 ◽

2013 ◽

Vol 2013 ◽

pp. 1-9 ◽

Cited By ~ 1

Author(s):

Ali Seman ◽

Zainab Abu Bakar ◽

Mohamed Nizam Isa

Keyword(s):

Y Chromosome ◽

Tandem Repeat ◽

Categorical Data ◽

Short Tandem Repeat ◽

Clustering Algorithms ◽

The Other ◽

Clustering Methods ◽

Performance Benchmarking ◽

Y Short Tandem Repeat ◽

Short Tandem

The Y-chromosome short tandem repeat (Y-STR) data are mainly collected for a performance benchmarking result in clustering methods. There are six Y-STR dataset items, divided into two categories: Y-STR surname and Y-haplogroup data presented here. The Y-STR data are categorical, unique, and different from the other categorical data. They are composed of a lot of similar and almost similar objects. This characteristic of the Y-STR data has caused certain problems of the existing clustering algorithms in clustering them.

Download Full-text

Database Anonymization Techniques with Focus on Uncertainty and Multi-Sensitive Attributes

Handbook of Research on Computational Intelligence for Engineering, Science, and Business ◽

10.4018/978-1-4666-2518-1.ch014 ◽

2013 ◽

pp. 364-383 ◽

Cited By ~ 1

Author(s):

B. K. Tripathy

Keyword(s):

Rough Set ◽

Categorical Data ◽

Clustering Algorithms ◽

Scientific Research ◽

Sensitive Information ◽

Open Problems ◽

Primary Focus

Publication of Data owned by various organizations for scientific research has the danger of sensitive information of respondents being disclosed. The policy of removal or encryption of identifiers cannot avoid the leakage of information through quasi-identifiers. So, several anonymization techniques like k-anonymity, l-diversity, and t-closeness have been proposed. However, uncertainty in data cannot be handled by these algorithms. One solution to this is to develop anonymization algorithms by using rough set based clustering algorithms like MMR, MMeR, SDR, SSDR, and MADE at the clustering stage of existing algorithms. Some of these algorithms handle both numerical and categorical data. In this chapter, the author addresses the database anonymization problem and briefly discusses k-anonymization methods. The primary focus is on the algorithms dealing with l-diversity of databases having single or multi-sensitive attributes. The author also proposes certain algorithms to deal with anonymization of databases with involved uncertainty. Also, the aim is to draw attention of researchers towards the various open problems in this direction.

Download Full-text

On Objective-Based Rough Hard and Fuzzyc-Means Clustering

Journal of Advanced Computational Intelligence and Intelligent Informatics ◽

10.20965/jaciii.2015.p0029 ◽

2015 ◽

Vol 19 (1) ◽

pp. 29-35 ◽

Cited By ~ 1

Author(s):

Naohiko Kinoshita ◽

◽

Yasunori Endo ◽

Keyword(s):

Objective Function ◽

Rough Set ◽

Clustering Algorithms ◽

Unsupervised Classification ◽

Classification Methods ◽

Clustering Methods ◽

Clustering Method ◽

Numerical Examples ◽

Initial Values

Clustering is one of the most popular unsupervised classification methods. In this paper, we focus on rough clustering methods based on rough-set representation. Rough k-Means (RKM) is one of the rough clustering method proposed by Lingras et al. Outputs of many clustering algorithms, including RKM depend strongly on initial values, so we must evaluate the validity of outputs. In the case of objectivebased clustering algorithms, the objective function is handled as the measure. It is difficult, however to evaluate the output in RKM, which is not objective-based. To solve this problem, we propose new objective-based rough clustering algorithms and verify theirs usefulness through numerical examples.

Download Full-text

Bayesian-OverDBC: A Bayesian Density-Based Approach for Modeling Overlapping Clusters

Mathematical Problems in Engineering ◽

10.1155/2015/187053 ◽

2015 ◽

Vol 2015 ◽

pp. 1-9 ◽

Cited By ~ 2

Author(s):

Mansooreh Mirzaie ◽

Ahmad Barani ◽

Naser Nematbakkhsh ◽

Majid Mohammad-Beigi

Keyword(s):

Probability Model ◽

Clustering Algorithms ◽

Large Data ◽

Clustering Methods ◽

Overlapping Clusters ◽

Density Based Clustering ◽

Real World Applications ◽

Gene Regulatory ◽

Large Data Analysis ◽

Better Than

Although most research in density-based clustering algorithms focused on finding distinct clusters, many real-world applications (such as gene functions in a gene regulatory network) have inherently overlapping clusters. Even with overlapping features, density-based clustering methods do not define a probabilistic model of data. Therefore, it is hard to determine how “good” clustering, predicting, and clustering new data into existing clusters are. Therefore, a probability model for overlap density-based clustering is a critical need for large data analysis. In this paper, a new Bayesian density-based method (Bayesian-OverDBC) for modeling the overlapping clusters is presented. Bayesian-OverDBC can predict the formation of a new cluster. It can also predict the overlapping of cluster with existing clusters. Bayesian-OverDBC has been compared with other algorithms (nonoverlapping and overlapping models). The results show that Bayesian-OverDBC can be significantly better than other methods in analyzing microarray data.

Download Full-text

Performance Comparison with Hierarchical and Partitional Clustering Methods

WSEAS TRANSACTIONS ON COMMUNICATIONS ◽

10.37394/23204.2021.20.23 ◽

2021 ◽

Vol 20 ◽

pp. 177-184

Author(s):

Ozer Ozdemir ◽

Simgenur Cerman

Keyword(s):

Data Mining ◽

Hierarchical Clustering ◽

Clustering Algorithms ◽

Performance Comparison ◽

Hierarchical Partitioning ◽

Clustering Methods ◽

Data Set ◽

Partitional Clustering ◽

Statistical Programming ◽

Using Data

In data mining, one of the commonly-used techniques is the clustering. Clustering can be done by the different algorithms such as hierarchical, partitioning, grid, density and graph based algorithms. In this study first of all the concept of data mining explained, then giving information the aims of using data mining and the areas of using and then clustering and clustering algorithms that used in data mining are explained theoretically. Ultimately within the scope of this study, "Mall Customers" data set that taken from Kaggle database, based partitioned clustering and hierarchical clustering algorithms aimed at the separation of clusters according to their costumers features. In the clusters obtained by the partitional clustering algorithms, the similarity within the cluster is maximum and the similarity between the clusters is minimum. The hierarchical clustering algorithms is based on the gathering of similar features or vice versa. The partitional clustering algorithms used; k-means and PAM, hierarchical clustering algorithms used; AGNES and DIANA are algorithms. In this study, R statistical programming language was used in the application of algorithms. At the end of the study, the data set was run with clustering algorithms and the obtained analysis results were interpreted.

Download Full-text

FROM PENALIZED MAXIMUM LIKELIHOOD TO CLUSTER ANALYSIS: A UNIFIED PROBABILISTIC FRAMEWORK OF CLUSTERING

International Journal of Pattern Recognition and Artificial Intelligence ◽

10.1142/s0218001407005569 ◽

2007 ◽

Vol 21 (03) ◽

pp. 483-490

Author(s):

XICHEN SUN ◽

QIANSHENG CHENG ◽

JUFU FENG

Keyword(s):

Maximum Likelihood ◽

Clustering Algorithms ◽

Gaussian Mixture ◽

Deterministic Annealing ◽

Probabilistic Framework ◽

Clustering Methods ◽

Penalized Maximum Likelihood ◽

Partitional Clustering ◽

Special Cases ◽

Fuzzy C Means Algorithm

A unified probabilistic framework (UPF) of partitional clustering algorithms is proposed based on Penalized Maximum Likelihood. Besides Gaussian Mixture model methods, many popular clustering methods, such as Fuzzy c-Means Algorithm (FCM), Attribute Means Clustering (AMC), General c-Means Clustering (GCM), and Deterministic Annealing (DA) Clustering can be explained as special cases within UPF. Furthermore, this UPF framework provides a general approach to design comparatively stable and effectively regularized clustering algorithms.

Download Full-text

A Novel Consensus Fuzzy K-Modes Clustering Using Coupling DNA-Chain-Hypergraph P System for Categorical Data

Processes ◽

10.3390/pr8101326 ◽

2020 ◽

Vol 8 (10) ◽

pp. 1326

Author(s):

Zhenni Jiang ◽

Xiyu Liu

Keyword(s):

Categorical Data ◽

Clustering Algorithm ◽

Clustering Algorithms ◽

Weight Vector ◽

P System ◽

Local Optimum ◽

Genetic Optimization ◽

Higher Weights ◽

Dna Chain ◽

Categorical Clustering

In this paper, a data clustering method named consensus fuzzy k-modes clustering is proposed to improve the performance of the clustering for the categorical data. At the same time, the coupling DNA-chain-hypergraph P system is constructed to realize the process of the clustering. This P system can prevent the clustering algorithm falling into the local optimum and realize the clustering process in implicit parallelism. The consensus fuzzy k-modes algorithm can combine the advantages of the fuzzy k-modes algorithm, weight fuzzy k-modes algorithm and genetic fuzzy k-modes algorithm. The fuzzy k-modes algorithm can realize the soft partition which is closer to reality, but treats all the variables equally. The weight fuzzy k-modes algorithm introduced the weight vector which strengthens the basic k-modes clustering by associating higher weights with features useful in analysis. These two methods are only improvements the k-modes algorithm itself. So, the genetic k-modes algorithm is proposed which used the genetic operations in the clustering process. In this paper, we examine these three kinds of k-modes algorithms and further introduce DNA genetic optimization operations in the final consensus process. Finally, we conduct experiments on the seven UCI datasets and compare the clustering results with another four categorical clustering algorithms. The experiment results and statistical test results show that our method can get better clustering results than the compared clustering algorithms, respectively.

Download Full-text