A Fast K-prototypes Algorithm Using Partial Distance Computation

Mapping Intimacies ◽

10.20944/preprints201704.0099.v1 ◽

2017 ◽

Author(s):

Byoungwook KIM

Keyword(s):

Minimum Distance ◽

Clustering Algorithm ◽

Cluster Center ◽

Maximum Difference ◽

Distance Computation ◽

Computational Performance ◽

Categorical Attributes ◽

Data Objects ◽

Numeric Data ◽

Numeric Attributes

The k-means is one of the most popular and widely used clustering algorithm, however, it is limited to only numeric data. The k-prototypes algorithm is one of the famous algorithms for dealing with both numeric and categorical data. However, there have been no studies to accelerate k-prototypes algorithm. In this paper, we propose a new fast k-prototypes algorithm that gives the same answer as original k-prototypes. The proposed algorithm avoids distance computations using partial distance computation. Our k-prototypes algorithm finds minimum distance without distance computations of all attributes between an object and a cluster center, which allows it to reduce time complexity. A partial distance computation uses a fact that a value of the maximum difference between two categorical attributes is 1 during distance computations. If data objects have m categorical attributes, maximum difference of categorical attributes between an object and a cluster center is m. Our algorithm first computes distance with only numeric attributes. If a difference of the minimum distance and the second smallest with numeric attributes is higher than m, we can find minimum distance between an object and a cluster center without distance computations of categorical attributes. The experimental shows proposed k-prototypes algorithm improves computational performance than original k-prototypes algorithm in our dataset.

Download Full-text

Weighted k-Prototypes Clustering Algorithm Based on the Hybrid Dissimilarity Coefficient

Mathematical Problems in Engineering ◽

10.1155/2020/5143797 ◽

2020 ◽

Vol 2020 ◽

pp. 1-13

Author(s):

Ziqi Jia ◽

Ling Song

Keyword(s):

Categorical Data ◽

Clustering Algorithm ◽

Numerical Data ◽

Experimental Results ◽

Cluster Center ◽

Real Dataset ◽

Dissimilarity Coefficient ◽

Initial Cluster ◽

Data Objects ◽

Selection Of

The k-prototypes algorithm is a hybrid clustering algorithm that can process Categorical Data and Numerical Data. In this study, the method of initial Cluster Center selection was improved and a new Hybrid Dissimilarity Coefficient was proposed. Based on the proposed Hybrid Dissimilarity Coefficient, a weighted k-prototype clustering algorithm based on the hybrid dissimilarity coefficient was proposed (WKPCA). The proposed WKPCA algorithm not only improves the selection of initial Cluster Centers, but also puts a new method to calculate the dissimilarity between data objects and Cluster Centers. The real dataset of UCI was used to test the WKPCA algorithm. Experimental results show that WKPCA algorithm is more efficient and robust than other k-prototypes algorithms.

Download Full-text

Clustering Categorical Data with k-Modes

Encyclopedia of Data Warehousing and Mining, Second Edition ◽

10.4018/978-1-60566-010-3.ch040 ◽

2011 ◽

pp. 246-250 ◽

Cited By ~ 2

Author(s):

Joshua Zhexue Huang

Keyword(s):

Real World ◽

Categorical Data ◽

Clustering Algorithm ◽

Clustering Algorithms ◽

Chemical Information ◽

New Techniques ◽

Small Set ◽

Categorical Attributes ◽

Categorical Attribute ◽

Numeric Data

A lot of data in real world databases are categorical. For example, gender, profession, position, and hobby of customers are usually defined as categorical attributes in the CUSTOMER table. Each categorical attribute is represented with a small set of unique categorical values such as {Female, Male} for the gender attribute. Unlike numeric data, categorical values are discrete and unordered. Therefore, the clustering algorithms for numeric data cannot be used to cluster categorical data that exists in many real world applications. In data mining research, much effort has been put on development of new techniques for clustering categorical data (Huang, 1997b; Huang, 1998; Gibson, Kleinberg, & Raghavan, 1998; Ganti, Gehrke, & Ramakrishnan, 1999; Guha, Rastogi, & Shim, 1999; Chaturvedi, Green, Carroll, & Foods, 2001; Barbara, Li, & Couto, 2002; Andritsos, Tsaparas, Miller, & Sevcik, 2003; Li, Ma, & Ogihara, 2004; Chen, & Liu, 2005; Parmar, Wu, & Blackhurst, 2007). The k-modes clustering algorithm (Huang, 1997b; Huang, 1998) is one of the first algorithms for clustering large categorical data. In the past decade, this algorithm has been well studied and widely used in various applications. It is also adopted in commercial software (e.g., Daylight Chemical Information Systems, Inc, http://www. daylight.com/).

Download Full-text

Clustering Mixed Data Based on Density Peaks and Stacked Denoising Autoencoders

Symmetry ◽

10.3390/sym11020163 ◽

2019 ◽

Vol 11 (2) ◽

pp. 163

Author(s):

Baobin Duan ◽

Lixin Han ◽

Zhinan Gou ◽

Yi Yang ◽

Shuangshuang Chen

Keyword(s):

Clustering Algorithm ◽

Local Density ◽

Clustering Algorithms ◽

Feature Space ◽

Original Data ◽

Mixed Data ◽

Feature Representations ◽

Density Peaks ◽

Categorical Attributes ◽

Data Objects

With the universal existence of mixed data with numerical and categorical attributes in real world, a variety of clustering algorithms have been developed to discover the potential information hidden in mixed data. Most existing clustering algorithms often compute the distances or similarities between data objects based on original data, which may cause the instability of clustering results because of noise. In this paper, a clustering framework is proposed to explore the grouping structure of the mixed data. First, the transformed categorical attributes by one-hot encoding technique and normalized numerical attributes are input to a stacked denoising autoencoders to learn the internal feature representations. Secondly, based on these feature representations, all the distances between data objects in feature space can be calculated and the local density and relative distance of each data object can be also computed. Thirdly, the density peaks clustering algorithm is improved and employed to allocate all the data objects into different clusters. Finally, experiments conducted on some UCI datasets have demonstrated that our proposed algorithm for clustering mixed data outperforms three baseline algorithms in terms of the clustering accuracy and the rand index.

Download Full-text

Fast Density Clustering Algorithm for Numerical Data and Categorical Data

Mathematical Problems in Engineering ◽

10.1155/2017/6393652 ◽

2017 ◽

Vol 2017 ◽

pp. 1-15 ◽

Cited By ~ 6

Author(s):

Chen Jinyin ◽

He Huihao ◽

Chen Jungan ◽

Yu Shanqing ◽

Shi Zhaoxia

Keyword(s):

Clustering Algorithm ◽

Clustering Algorithms ◽

Numerical Data ◽

Mixed Data ◽

Cluster Center ◽

Data Object ◽

Numerical Attributes ◽

Clustering Quality ◽

Categorical Attributes ◽

Density Clustering

Data objects with mixed numerical and categorical attributes are often dealt with in the real world. Most existing algorithms have limitations such as low clustering quality, cluster center determination difficulty, and initial parameter sensibility. A fast density clustering algorithm (FDCA) is put forward based on one-time scan with cluster centers automatically determined by center set algorithm (CSA). A novel data similarity metric is designed for clustering data including numerical attributes and categorical attributes. CSA is designed to choose cluster centers from data object automatically which overcome the cluster centers setting difficulty in most clustering algorithms. The performance of the proposed method is verified through a series of experiments on ten mixed data sets in comparison with several other clustering algorithms in terms of the clustering purity, the efficiency, and the time complexity.

Download Full-text

Discovering Similarity Across Heterogeneous Features

International Journal of Data Warehousing and Mining ◽

10.4018/ijdwm.2020100104 ◽

2020 ◽

Vol 16 (4) ◽

pp. 63-83

Author(s):

Vandana P. Janeja ◽

Josephine M. Namayanja ◽

Yelena Yesha ◽

Anuja Kench ◽

Vasundhara Misal

Keyword(s):

Data Clustering ◽

Clustering Algorithm ◽

Heterogeneous Data ◽

Similarity Function ◽

Clustering Techniques ◽

Heterogeneous Features ◽

Categorical Attributes ◽

Data Objects

The analysis of both continuous and categorical attributes generating a heterogeneous mix of attributes poses challenges in data clustering. Traditional clustering techniques like k-means clustering work well when applied to small homogeneous datasets. However, as the data size becomes large, it becomes increasingly difficult to find meaningful and well-formed clusters. In this paper, the authors propose an approach that utilizes a combined similarity function, which looks at similarity across numeric and categorical features and employs this function in a clustering algorithm to identify similarity between data objects. The findings indicate that the proposed approach handles heterogeneous data better by forming well-separated clusters.

Download Full-text

A Similarity Measurement with Entropy-Based Weighting for Clustering Mixed Numerical and Categorical Datasets

Algorithms ◽

10.3390/a14060184 ◽

2021 ◽

Vol 14 (6) ◽

pp. 184

Author(s):

Xia Que ◽

Siyuan Jiang ◽

Jiaoyun Yang ◽

Ning An

Keyword(s):

Categorical Data ◽

Clustering Algorithm ◽

Numerical Data ◽

Similarity Measurement ◽

Amount Of Information ◽

Automatic Categorization ◽

Categorical Attributes ◽

Weighting Strategy

Many mixed datasets with both numerical and categorical attributes have been collected in various fields, including medicine, biology, etc. Designing appropriate similarity measurements plays an important role in clustering these datasets. Many traditional measurements treat various attributes equally when measuring the similarity. However, different attributes may contribute differently as the amount of information they contained could vary a lot. In this paper, we propose a similarity measurement with entropy-based weighting for clustering mixed datasets. The numerical data are first transformed into categorical data by an automatic categorization technique. Then, an entropy-based weighting strategy is applied to denote the different importances of various attributes. We incorporate the proposed measurement into an iterative clustering algorithm, and extensive experiments show that this algorithm outperforms OCIL and K-Prototype methods with 2.13% and 4.28% improvements, respectively, in terms of accuracy on six mixed datasets from UCI.

Download Full-text

A Hard C-Means Clustering Algorithm Incorporating Membership KL Divergence and Local Data Information for Noisy Image Segmentation

International Journal of Pattern Recognition and Artificial Intelligence ◽

10.1142/s021800141850012x ◽

2017 ◽

Vol 32 (04) ◽

pp. 1850012 ◽

Cited By ~ 5

Author(s):

R. R. Gharieb ◽

G. Gendy ◽

H. Selim

Keyword(s):

Image Segmentation ◽

Membership Function ◽

Clustering Algorithm ◽

Clustering Algorithms ◽

Cluster Center ◽

Local Data ◽

Cluster Membership ◽

Kl Divergence ◽

Clustering Approach ◽

Center Distance

In this paper, the standard hard C-means (HCM) clustering approach to image segmentation is modified by incorporating weighted membership Kullback–Leibler (KL) divergence and local data information into the HCM objective function. The membership KL divergence, used for fuzzification, measures the proximity between each cluster membership function of a pixel and the locally-smoothed value of the membership in the pixel vicinity. The fuzzification weight is a function of the pixel to cluster-centers distances. The used pixel to a cluster-center distance is composed of the original pixel data distance plus a fraction of the distance generated from the locally-smoothed pixel data. It is shown that the obtained membership function of a pixel is proportional to the locally-smoothed membership function of this pixel multiplied by an exponentially distributed function of the minus pixel distance relative to the minimum distance provided by the nearest cluster-center to the pixel. Therefore, since incorporating the locally-smoothed membership and data information in addition to the relative distance, which is more tolerant to additive noise than the absolute distance, the proposed algorithm has a threefold noise-handling process. The presented algorithm, named local data and membership KL divergence based fuzzy C-means (LDMKLFCM), is tested by synthetic and real-world noisy images and its results are compared with those of several FCM-based clustering algorithms.

Download Full-text

Improved Fuzzy FCM-LI Algorithm

Advanced Materials Research ◽

10.4028/www.scientific.net/amr.765-767.670 ◽

2013 ◽

Vol 765-767 ◽

pp. 670-673

Author(s):

Li Bo Hou

Keyword(s):

Real Time ◽

Clustering Algorithm ◽

Feature Analysis ◽

Cluster Center ◽

High Dimensional ◽

Fuzzy C Means ◽

Sample Data ◽

Fuzzy C Means Clustering ◽

Fcm Clustering ◽

Np Hard Problem

Fuzzy C-means (FCM) clustering algorithm is one of the widely applied algorithms in non-supervision of pattern recognition. However, FCM algorithm in the iterative process requires a lot of calculations, especially when feature vectors has high-dimensional, Use clustering algorithm to sub-heap, not only inefficient, but also may lead to "the curse of dimensionality." For the problem, This paper analyzes the fuzzy C-means clustering algorithm in high dimensional feature of the process, the problem of cluster center is an np-hard problem, In order to improve the effectiveness and Real-time of fuzzy C-means clustering algorithm in high dimensional feature analysis, Combination of landmark isometric (L-ISOMAP) algorithm, Proposed improved algorithm FCM-LI. Preliminary analysis of the samples, Use clustering results and the correlation of sample data, using landmark isometric (L-ISOMAP) algorithm to reduce the dimension, further analysis on the basis, obtained the final results. Finally, experimental results show that the effectiveness and Real-time of FCM-LI algorithm in high dimensional feature analysis.

Download Full-text

Clustering Algorithm for Multiple Data Streams Based on Data Cloud Node

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.462-463.247 ◽

2013 ◽

Vol 462-463 ◽

pp. 247-250

Author(s):

Sa Li ◽

Liang Shan Shao

Keyword(s):

Data Streams ◽

Minimum Distance ◽

Clustering Algorithm ◽

Cloud Model ◽

Data Sequence ◽

Multiple Data ◽

Multiple Data Streams ◽

Model Algorithm

Multiple data streams clustering aims to clustering multiple data streams according to their similarity while tracking their changes with time . This paper proposes M_SCCStream algorithm based on cloud model. Algorithm introduces data cloud node structure with hierarchical characteristics to represent different granularity data sequence and takes the entropy indicated the degree of data changes. Algorithm finds micro_clustering with the minimum distance and then obtains the clustering result of multiple data streams through calculating the correlation degrees of micro_clustering. The experiment proves that the algorithm has higher quality and stability.

Download Full-text

Clustering Algorithm for Incomplete Data Sets with Mixed Numeric and Categorical Attributes

International Journal of Database Theory and Application ◽

10.14257/ijdta.2013.6.5.09 ◽

2013 ◽

Vol 6 (5) ◽

pp. 95-104 ◽

Cited By ~ 4

Author(s):

Wu Sen ◽

Chen Hong ◽

Feng Xiaodong

Keyword(s):

Incomplete Data ◽

Clustering Algorithm ◽

Data Sets ◽

Categorical Attributes

Download Full-text