Weighted k-Prototypes Clustering Algorithm Based on the Hybrid Dissimilarity Coefficient

The k-prototypes algorithm is a hybrid clustering algorithm that can process Categorical Data and Numerical Data. In this study, the method of initial Cluster Center selection was improved and a new Hybrid Dissimilarity Coefficient was proposed. Based on the proposed Hybrid Dissimilarity Coefficient, a weighted k-prototype clustering algorithm based on the hybrid dissimilarity coefficient was proposed (WKPCA). The proposed WKPCA algorithm not only improves the selection of initial Cluster Centers, but also puts a new method to calculate the dissimilarity between data objects and Cluster Centers. The real dataset of UCI was used to test the WKPCA algorithm. Experimental results show that WKPCA algorithm is more efficient and robust than other k-prototypes algorithms.

Download Full-text

Research on Improved K-Means Clustering Algorithm

Advanced Materials Research ◽

10.4028/www.scientific.net/amr.403-408.1977 ◽

2011 ◽

Vol 403-408 ◽

pp. 1977-1980

Author(s):

Yin Sheng Zhang ◽

Hui Lin Shan ◽

Jia Qiang Li ◽

Jie Zhou

Keyword(s):

Clustering Algorithm ◽

Cluster Center ◽

Local Optimum ◽

Java Language ◽

Initial Cluster ◽

Feature Extraction And Selection ◽

Clustering Quality ◽

Fuzzy Neural ◽

Hierarchical Clustering Algorithm ◽

Selection Of

The traditional K-means clustering algorithm prematurely plunges into a local optimum because of sensitive selection of the initial cluster center. Hierarchical clustering algorithm can be used to generate the initial cluster center of K-means clustering algorithm. The geometric features of input data can achieve a good distribution by means of pretreatment and feature extraction and selection. In the learning of fuzzy neural network, Java language is used to write source code of the algorithm. The experimental results show that new algorithm has improved the clustering quality effectively.

Download Full-text

The Discretization of Continuous Attributes Based on Improved SOM Clustering

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.701-702.88 ◽

2014 ◽

Vol 701-702 ◽

pp. 88-93 ◽

Cited By ~ 1

Author(s):

Gang Tao ◽

Yong Gang Yan ◽

Jiao Zou ◽

Jun Liu

Keyword(s):

Hierarchical Clustering ◽

Clustering Algorithm ◽

Nearest Neighbors ◽

Experimental Results ◽

Cluster Center ◽

Initial Cluster ◽

Hierarchical Clustering Algorithm ◽

Som Clustering ◽

Continuous Attribute

In order to solve the problem of continuous attribute discretization, a new improved SOM clustering algorithm was proposed. The algorithm uses the SOM to achieve the initial cluster and get the clustering up limit, then treats the initial cluster centers as samples and use the BIRCH hierarchical clustering algorithm to get secondary clustering, then solves the problems of inflated clusters and identifies discrete breakpoints set. Finally, find the nearest neighbors of each cluster center among these any samples of Breakpoints sets which belong to its attribute, and use it as a basis of discrete trimming. The experimental results show that the proposed algorithm outperforms the conventional discrete SOM clustering algorithm in the breakpoints set (contour factor to enhance 75%) and discrete accuracy (incompatible degrees closer to 0) aspects.

Download Full-text

A Similarity Measurement with Entropy-Based Weighting for Clustering Mixed Numerical and Categorical Datasets

Algorithms ◽

10.3390/a14060184 ◽

2021 ◽

Vol 14 (6) ◽

pp. 184

Author(s):

Xia Que ◽

Siyuan Jiang ◽

Jiaoyun Yang ◽

Ning An

Keyword(s):

Categorical Data ◽

Clustering Algorithm ◽

Numerical Data ◽

Similarity Measurement ◽

Amount Of Information ◽

Automatic Categorization ◽

Categorical Attributes ◽

Weighting Strategy

Many mixed datasets with both numerical and categorical attributes have been collected in various fields, including medicine, biology, etc. Designing appropriate similarity measurements plays an important role in clustering these datasets. Many traditional measurements treat various attributes equally when measuring the similarity. However, different attributes may contribute differently as the amount of information they contained could vary a lot. In this paper, we propose a similarity measurement with entropy-based weighting for clustering mixed datasets. The numerical data are first transformed into categorical data by an automatic categorization technique. Then, an entropy-based weighting strategy is applied to denote the different importances of various attributes. We incorporate the proposed measurement into an iterative clustering algorithm, and extensive experiments show that this algorithm outperforms OCIL and K-Prototype methods with 2.13% and 4.28% improvements, respectively, in terms of accuracy on six mixed datasets from UCI.

Download Full-text

Improved minimum-minimum roughness algorithm for clustering categorical data

International Journal of ADVANCED AND APPLIED SCIENCES ◽

10.21833/ijaas.2021.10.006 ◽

2021 ◽

Vol 8 (10) ◽

pp. 43-50

Author(s):

Truong et al. ◽

Keyword(s):

Machine Learning ◽

Data Mining ◽

Hierarchical Clustering ◽

Categorical Data ◽

Clustering Algorithm ◽

Clustering Algorithms ◽

Experimental Results ◽

Data Sets ◽

Top Down ◽

Hierarchical Clustering Algorithm

Clustering is a fundamental technique in data mining and machine learning. Recently, many researchers are interested in the problem of clustering categorical data and several new approaches have been proposed. One of the successful and pioneering clustering algorithms is the Minimum-Minimum Roughness algorithm (MMR) which is a top-down hierarchical clustering algorithm and can handle the uncertainty in clustering categorical data. However, MMR tends to choose the category with less value leaf node with more objects, leading to undesirable clustering results. To overcome such shortcomings, this paper proposes an improved version of the MMR algorithm for clustering categorical data, called IMMR (Improved Minimum-Minimum Roughness). Experimental results on actual data sets taken from UCI show that the IMMR algorithm outperforms MMR in clustering categorical data.

Download Full-text

Density Clustering Algorithm Based on the Dynamic Selection of Cluster Center

2019 International Conference on Cyber-Enabled Distributed Computing and Knowledge Discovery (CyberC) ◽

10.1109/cyberc.2019.00050 ◽

2019 ◽

Author(s):

Lulu Sun ◽

Ruilin Zhang

Keyword(s):

Clustering Algorithm ◽

Cluster Center ◽

Dynamic Selection ◽

Density Clustering ◽

Selection Of

Download Full-text

A Modified Overlapping Partitioning Clustering Algorithm for Categorical Data Clustering

Bulletin of Electrical Engineering and Informatics ◽

10.11591/eei.v7i1.896 ◽

2018 ◽

Vol 7 (1) ◽

pp. 55-62

Author(s):

Mohammad Alaqtash ◽

Moayad A.Fadhil ◽

Ali F. Al-Azzawi

Keyword(s):

Categorical Data ◽

Clustering Algorithm ◽

Clustering Algorithms ◽

Numerical Data ◽

Data Representation ◽

The Past ◽

Textual Data ◽

Traditional Algorithm ◽

Clustering Problems ◽

Categorical Data Clustering

Clustering is one of the important approaches for Clustering enables the grouping of unlabeled data by partitioning data into clusters with similar patterns. Over the past decades, many clustering algorithms have been developed for various clustering problems. An overlapping partitioning clustering (OPC) algorithm can only handle numerical data. Hence, novel clustering algorithms have been studied extensively to overcome this issue. By increasing the number of objects belonging to one cluster and distance between cluster centers, the study aimed to cluster the textual data type without losing the main functions. The proposed study herein included over twenty newsgroup dataset, which consisted of approximately 20000 textual documents. By introducing some modifications to the traditional algorithm, an acceptable level of homogeneity and completeness of clusters were generated. Modifications were performed on the pre-processing phase and data representation, along with the number methods which influence the primary function of the algorithm. Subsequently, the results were evaluated and compared with the k-means algorithm of the training and test datasets. The results indicated that the modified algorithm could successfully handle the categorical data and produce satisfactory clusters.

Download Full-text

A Fast K-prototypes Algorithm Using Partial Distance Computation

10.20944/preprints201704.0099.v1 ◽

2017 ◽

Author(s):

Byoungwook KIM

Keyword(s):

Minimum Distance ◽

Clustering Algorithm ◽

Cluster Center ◽

Maximum Difference ◽

Distance Computation ◽

Computational Performance ◽

Categorical Attributes ◽

Data Objects ◽

Numeric Data ◽

Numeric Attributes

The k-means is one of the most popular and widely used clustering algorithm, however, it is limited to only numeric data. The k-prototypes algorithm is one of the famous algorithms for dealing with both numeric and categorical data. However, there have been no studies to accelerate k-prototypes algorithm. In this paper, we propose a new fast k-prototypes algorithm that gives the same answer as original k-prototypes. The proposed algorithm avoids distance computations using partial distance computation. Our k-prototypes algorithm finds minimum distance without distance computations of all attributes between an object and a cluster center, which allows it to reduce time complexity. A partial distance computation uses a fact that a value of the maximum difference between two categorical attributes is 1 during distance computations. If data objects have m categorical attributes, maximum difference of categorical attributes between an object and a cluster center is m. Our algorithm first computes distance with only numeric attributes. If a difference of the minimum distance and the second smallest with numeric attributes is higher than m, we can find minimum distance between an object and a cluster center without distance computations of categorical attributes. The experimental shows proposed k-prototypes algorithm improves computational performance than original k-prototypes algorithm in our dataset.

Download Full-text

Ensemble Hybrid K- Means and DBSCAN Clustering Algorithm – HDKA for Cancer Dataset

International Journal of Recent Technology and Engineering - 2 ◽

10.35940/ijrte.d8257.118419 ◽

2019 ◽

Vol 8 (4) ◽

pp. 6036-6040

Keyword(s):

Machine Learning ◽

Data Mining ◽

Clustering Algorithm ◽

Clustering Algorithms ◽

Experimental Results ◽

Cancer Dataset ◽

Dbscan Clustering ◽

Selection Of

Data Mining is the foremost vital space of analysis and is pragmatically utilized in totally different domains, It becomes a highly demanding field because huge amounts of data have been collected in various applications. The database can be clustered in more number of ways depending on the clustering algorithm used, parameter settings and other factors. Multiple clustering algorithms can be combined to get the final partitioning of data which provides better clustering results. In this paper, Ensemble hybrid KMeans and DBSCAN (HDKA) algorithm has been proposed to overcome the drawbacks of DBSCAN and KMeans clustering algorithms. The performance of the proposed algorithm improves the selection of centroid points through the centroid selection strategy.For experimental results we have used two dataset Colon and Leukemia from UCI machine learning repository.

Download Full-text

A Fast Density Peak Clustering Method with Autoselect Cluster Centers

Mobile Information Systems ◽

10.1155/2022/4176101 ◽

2022 ◽

Vol 2022 ◽

pp. 1-13

Author(s):

Zhihe Wang ◽

Yongbiao Li ◽

Hui Du ◽

Xiaofen Wei

Keyword(s):

Experimental Results ◽

Cluster Center ◽

Clustering Method ◽

Density Peak ◽

Density Peaks ◽

Density Peaks Clustering ◽

Manual Selection ◽

Density Peak Clustering ◽

Selection Of

Aiming at density peaks clustering needs to manually select cluster centers, this paper proposes a fast new clustering method with auto-select cluster centers. Firstly, our method groups the data and marks each group as core or boundary groups according to its density. Secondly, it determines clusters by iteratively merging two core groups whose distance is less than the threshold and selects the cluster centers at the densest position in each cluster. Finally, it assigns boundary groups to the cluster corresponding to the nearest cluster center. Our method eliminates the need for the manual selection of cluster centers and improves clustering efficiency with the experimental results.

Download Full-text

Nonuniform Sparse Data Clustering Cascade Algorithm Based on Dynamic Cumulative Entropy

Mathematical Problems in Engineering ◽

10.1155/2016/5707692 ◽

2016 ◽

Vol 2016 ◽

pp. 1-10

Author(s):

Ning Li ◽

Yunxia Gu ◽

Zhongliang Deng

Keyword(s):

Initial Data ◽

Prior Knowledge ◽

Clustering Algorithm ◽

Sparse Data ◽

Cluster Center ◽

Control Factor ◽

Cascade Algorithm ◽

Initial Cluster ◽

Data Clusters ◽

Cumulative Entropy

A small amount of prior knowledge and randomly chosen initial cluster centers have a direct impact on the accuracy of the performance of iterative clustering algorithm. In this paper we propose a new algorithm to compute initial cluster centers for k-means clustering and the best number of the clusters with little prior knowledge and optimize clustering result. It constructs the Euclidean distance control factor based on aggregation density sparse degree to select the initial cluster center of nonuniform sparse data and obtains initial data clusters by multidimensional diffusion density distribution. Multiobjective clustering approach based on dynamic cumulative entropy is adopted to optimize the initial data clusters and the best number of the clusters. The experimental results show that the newly proposed algorithm has good performance to obtain the initial cluster centers for the k-means algorithm and it effectively improves the clustering accuracy of nonuniform sparse data by about 5%.

Download Full-text