A Modified Overlapping Partitioning Clustering Algorithm for Categorical Data Clustering

Clustering is one of the important approaches for Clustering enables the grouping of unlabeled data by partitioning data into clusters with similar patterns. Over the past decades, many clustering algorithms have been developed for various clustering problems. An overlapping partitioning clustering (OPC) algorithm can only handle numerical data. Hence, novel clustering algorithms have been studied extensively to overcome this issue. By increasing the number of objects belonging to one cluster and distance between cluster centers, the study aimed to cluster the textual data type without losing the main functions. The proposed study herein included over twenty newsgroup dataset, which consisted of approximately 20000 textual documents. By introducing some modifications to the traditional algorithm, an acceptable level of homogeneity and completeness of clusters were generated. Modifications were performed on the pre-processing phase and data representation, along with the number methods which influence the primary function of the algorithm. Subsequently, the results were evaluated and compared with the k-means algorithm of the training and test datasets. The results indicated that the modified algorithm could successfully handle the categorical data and produce satisfactory clusters.

Download Full-text

The Performance of BERT as Data Representation of Text Clustering

10.21203/rs.3.rs-940164/v1 ◽

2021 ◽

Author(s):

Alvin Subakti ◽

Hendri Murfi ◽

Nora Hariadi

Keyword(s):

Feature Extraction ◽

Clustering Algorithm ◽

Clustering Algorithms ◽

Text Clustering ◽

Data Representation ◽

Text Representation ◽

Inverse Document Frequency ◽

Document Frequency ◽

Normalization Methods ◽

Textual Data

Abstract Text clustering is the task of grouping a set of texts so that text in the same group will be more similar than those from a different group. The process of grouping text manually requires a significant amount of time and labor. Therefore, automation utilizing machine learning is necessary. The standard method used to represent textual data is Term Frequency Inverse Document Frequency (TFIDF). However, TFIDF cannot consider the position and context of a word in a sentence. Bidirectional Encoder Representation from Transformers (BERT) model can produce text representation that incorporates the position and context of a word in a sentence. This research analyzed the performance of the BERT model as data representation for text. Moreover, various feature extraction and normalization methods are also applied for the data representation of the BERT model. To examine the performances of BERT, we use four clustering algorithms, i.e., k-means clustering, eigenspace-based fuzzy c-means, deep embedded clustering, and improved deep embedded clustering. Our simulations show that BERT outperforms the standard TFIDF method in 28 out of 36 metrics. Furthermore, different feature extraction and normalization produced varied performances. The usage of these feature extraction and normalization must be altered depending on the text clustering algorithm used.

Download Full-text

A Similarity Measurement with Entropy-Based Weighting for Clustering Mixed Numerical and Categorical Datasets

Algorithms ◽

10.3390/a14060184 ◽

2021 ◽

Vol 14 (6) ◽

pp. 184

Author(s):

Xia Que ◽

Siyuan Jiang ◽

Jiaoyun Yang ◽

Ning An

Keyword(s):

Categorical Data ◽

Clustering Algorithm ◽

Numerical Data ◽

Similarity Measurement ◽

Amount Of Information ◽

Automatic Categorization ◽

Categorical Attributes ◽

Weighting Strategy

Many mixed datasets with both numerical and categorical attributes have been collected in various fields, including medicine, biology, etc. Designing appropriate similarity measurements plays an important role in clustering these datasets. Many traditional measurements treat various attributes equally when measuring the similarity. However, different attributes may contribute differently as the amount of information they contained could vary a lot. In this paper, we propose a similarity measurement with entropy-based weighting for clustering mixed datasets. The numerical data are first transformed into categorical data by an automatic categorization technique. Then, an entropy-based weighting strategy is applied to denote the different importances of various attributes. We incorporate the proposed measurement into an iterative clustering algorithm, and extensive experiments show that this algorithm outperforms OCIL and K-Prototype methods with 2.13% and 4.28% improvements, respectively, in terms of accuracy on six mixed datasets from UCI.

Download Full-text

A Novel Cosine Similarity Like Data Clustering Method for Effective Data Classification in Data Mining

International Journal of Innovative Technology and Exploring Engineering - Special Issue ◽

10.35940/ijitee.h6417.069820 ◽

2020 ◽

Vol 9 (8) ◽

pp. 340-346

Keyword(s):

Data Mining ◽

Similarity Measure ◽

Categorical Data ◽

Data Clustering ◽

Similarity Measures ◽

Numerical Data ◽

Data Classification ◽

Fundamental Goal ◽

Learning Technique ◽

Categorical Data Clustering

In data mining ample techniques use distance based measures for data clustering. Improving clustering performance is the fundamental goal in cluster domain related tasks. Many techniques are available for clustering numerical data as well as categorical data. Clustering is an unsupervised learning technique and objects are grouped or clustered based on similarity among the objects. A new cluster similarity finding measure, which is cosine like cluster similarity measure (CLCSM), is proposed in this paper. The proposed cluster similarity measure is used for data classification. Extensive experiments are conducted by taking UCI machine learning datasets. The experimental results have shown that the proposed cosinelike cluster similarity measure is superior to many of the existing cluster similarity measures for data classification.

Download Full-text

Improved minimum-minimum roughness algorithm for clustering categorical data

International Journal of ADVANCED AND APPLIED SCIENCES ◽

10.21833/ijaas.2021.10.006 ◽

2021 ◽

Vol 8 (10) ◽

pp. 43-50

Author(s):

Truong et al. ◽

Keyword(s):

Machine Learning ◽

Data Mining ◽

Hierarchical Clustering ◽

Categorical Data ◽

Clustering Algorithm ◽

Clustering Algorithms ◽

Experimental Results ◽

Data Sets ◽

Top Down ◽

Hierarchical Clustering Algorithm

Clustering is a fundamental technique in data mining and machine learning. Recently, many researchers are interested in the problem of clustering categorical data and several new approaches have been proposed. One of the successful and pioneering clustering algorithms is the Minimum-Minimum Roughness algorithm (MMR) which is a top-down hierarchical clustering algorithm and can handle the uncertainty in clustering categorical data. However, MMR tends to choose the category with less value leaf node with more objects, leading to undesirable clustering results. To overcome such shortcomings, this paper proposes an improved version of the MMR algorithm for clustering categorical data, called IMMR (Improved Minimum-Minimum Roughness). Experimental results on actual data sets taken from UCI show that the IMMR algorithm outperforms MMR in clustering categorical data.

Download Full-text

Weighted k-Prototypes Clustering Algorithm Based on the Hybrid Dissimilarity Coefficient

Mathematical Problems in Engineering ◽

10.1155/2020/5143797 ◽

2020 ◽

Vol 2020 ◽

pp. 1-13

Author(s):

Ziqi Jia ◽

Ling Song

Keyword(s):

Categorical Data ◽

Clustering Algorithm ◽

Numerical Data ◽

Experimental Results ◽

Cluster Center ◽

Real Dataset ◽

Dissimilarity Coefficient ◽

Initial Cluster ◽

Data Objects ◽

Selection Of

The k-prototypes algorithm is a hybrid clustering algorithm that can process Categorical Data and Numerical Data. In this study, the method of initial Cluster Center selection was improved and a new Hybrid Dissimilarity Coefficient was proposed. Based on the proposed Hybrid Dissimilarity Coefficient, a weighted k-prototype clustering algorithm based on the hybrid dissimilarity coefficient was proposed (WKPCA). The proposed WKPCA algorithm not only improves the selection of initial Cluster Centers, but also puts a new method to calculate the dissimilarity between data objects and Cluster Centers. The real dataset of UCI was used to test the WKPCA algorithm. Experimental results show that WKPCA algorithm is more efficient and robust than other k-prototypes algorithms.

Download Full-text

Understanding and Enhancement of Internal Clustering Validation Indexes for Categorical Data

Algorithms ◽

10.3390/a11110177 ◽

2018 ◽

Vol 11 (11) ◽

pp. 177 ◽

Cited By ~ 2

Author(s):

Xuedong Gao ◽

Minghan Yang

Keyword(s):

Machine Learning ◽

Categorical Data ◽

Data Clustering ◽

Information Gain ◽

Clustering Algorithms ◽

Number Of Clusters ◽

Cluster Compactness ◽

Clustering Validation ◽

Categorical Data Clustering

Clustering is one of the main tasks of machine learning. Internal clustering validation indexes (CVIs) are used to measure the quality of several clustered partitions to determine the local optimal clustering results in an unsupervised manner, and can act as the objective function of clustering algorithms. In this paper, we first studied several well-known internal CVIs for categorical data clustering, and proved the ineffectiveness of evaluating the partitions of different numbers of clusters without any inter-cluster separation measures or assumptions; the accurateness of separation, along with its coordination with the intra-cluster compactness measures, can notably affect performance. Then, aiming to enhance the internal clustering validation measurement, we proposed a new internal CVI—clustering utility based on the averaged information gain of isolating each cluster (CUBAGE)—which measures both the compactness and the separation of the partition. The experimental results supported our findings with regard to the existing internal CVIs, and showed that the proposed CUBAGE outperforms other internal CVIs with or without a pre-known number of clusters.

Download Full-text

A Dynamic Genetic Algorithm for Clustering Problems

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.411-414.1884 ◽

2013 ◽

Vol 411-414 ◽

pp. 1884-1893

Author(s):

Yong Chun Cao ◽

Ya Bin Shao ◽

Shuang Liang Tian ◽

Zheng Qi Cai

Keyword(s):

Genetic Algorithm ◽

Clustering Algorithm ◽

Clustering Algorithms ◽

Real Life ◽

Search Space ◽

Adaptive Mutation ◽

Data Sets ◽

Data Set ◽

Local Optima ◽

Clustering Problems

Due to many of the clustering algorithms based on GAs suffer from degeneracy and are easy to fall in local optima, a novel dynamic genetic algorithm for clustering problems (DGA) is proposed. The algorithm adopted the variable length coding to represent individuals and processed the parallel crossover operation in the subpopulation with individuals of the same length, which allows the DGA algorithm clustering to explore the search space more effectively and can automatically obtain the proper number of clusters and the proper partition from a given data set; the algorithm used the dynamic crossover probability and adaptive mutation probability, which prevented the dynamic clustering algorithm from getting stuck at a local optimal solution. The clustering results in the experiments on three artificial data sets and two real-life data sets show that the DGA algorithm derives better performance and higher accuracy on clustering problems.

Download Full-text

EvolveCluster: an evolutionary clustering algorithm for streaming data

Evolving Systems ◽

10.1007/s12530-021-09408-y ◽

2021 ◽

Author(s):

Christian Nordahl ◽

Veselka Boeva ◽

Håkan Grahn ◽

Marie Persson Netz

Keyword(s):

Data Streams ◽

Data Stream ◽

Clustering Algorithm ◽

Clustering Algorithms ◽

Streaming Data ◽

Evolutionary Clustering ◽

Stream Clustering ◽

The Past ◽

Data Stream Clustering ◽

Evolving Data

AbstractData has become an integral part of our society in the past years, arriving faster and in larger quantities than before. Traditional clustering algorithms rely on the availability of entire datasets to model them correctly and efficiently. Such requirements are not possible in the data stream clustering scenario, where data arrives and needs to be analyzed continuously. This paper proposes a novel evolutionary clustering algorithm, entitled EvolveCluster, capable of modeling evolving data streams. We compare EvolveCluster against two other evolutionary clustering algorithms, PivotBiCluster and Split-Merge Evolutionary Clustering, by conducting experiments on three different datasets. Furthermore, we perform additional experiments on EvolveCluster to further evaluate its capabilities on clustering evolving data streams. Our results show that EvolveCluster manages to capture evolving data stream behaviors and adapts accordingly.

Download Full-text

Improving categorical data clustering algorithm by weighting uncommon attribute value matches

Computer Science and Information Systems ◽

10.2298/csis0601023h ◽

2006 ◽

Vol 3 (1) ◽

pp. 23-32 ◽

Cited By ~ 2

Author(s):

Zengyou He ◽

Xiaofei Xu ◽

Shenchun Deng

Keyword(s):

Categorical Data ◽

Data Clustering ◽

Clustering Algorithm ◽

Real Life ◽

Experimental Results ◽

Categorical Data Clustering ◽

Modified Algorithm ◽

Attribute Value

This paper presents an improved Squeezer algorithm for categorical data clustering by giving greater weight to uncommon attribute value matches in similarity computations. Experimental results on real life datasets show that, the modified algorithm is superior to the original Squeezer algorithm and other clustering algorithm with respect to clustering accuracy.

Download Full-text

A Novel Approach for Clustering Big Data based on MapReduce

International Journal of Electrical and Computer Engineering (IJECE) ◽

10.11591/ijece.v8i3.pp1711-1719 ◽

2018 ◽

Vol 8 (3) ◽

pp. 1711 ◽

Cited By ~ 1

Author(s):

Gourav Bathla ◽

Himanshu Aggarwal ◽

Rinkle Rani

Keyword(s):

Big Data ◽

Categorical Data ◽

Large Scale ◽

Clustering Algorithms ◽

Numerical Data ◽

Large Data ◽

Data Sets ◽

Single Node ◽

Novel Approach ◽

Network Analytics

Clustering is one of the most important applications of data mining. It has attracted attention of researchers in statistics and machine learning. It is used in many applications like information retrieval, image processing and social network analytics etc. It helps the user to understand the similarity and dissimilarity between objects. Cluster analysis makes the users understand complex and large data sets more clearly. There are different types of clustering algorithms analyzed by various researchers. Kmeans is the most popular partitioning based algorithm as it provides good results because of accurate calculation on numerical data. But Kmeans give good results for numerical data only. Big data is combination of numerical and categorical data. Kprototype algorithm is used to deal with numerical as well as categorical data. Kprototype combines the distance calculated from numeric and categorical data. With the growth of data due to social networking websites, business transactions, scientific calculation etc., there is vast collection of structured, semi-structured and unstructured data. So, there is need of optimization of Kprototype so that these varieties of data can be analyzed efficiently.In this work, Kprototype algorithm is implemented on MapReduce in this paper. Experiments have proved that Kprototype implemented on Mapreduce gives better performance gain on multiple nodes as compared to single node. CPU execution time and speedup are used as evaluation metrics for comparison.Intellegent splitter is proposed in this paper which splits mixed big data into numerical and categorical data. Comparison with traditional algorithms proves that proposed algorithm works better for large scale of data.

Download Full-text