Discovering Similarity Across Heterogeneous Features

The analysis of both continuous and categorical attributes generating a heterogeneous mix of attributes poses challenges in data clustering. Traditional clustering techniques like k-means clustering work well when applied to small homogeneous datasets. However, as the data size becomes large, it becomes increasingly difficult to find meaningful and well-formed clusters. In this paper, the authors propose an approach that utilizes a combined similarity function, which looks at similarity across numeric and categorical features and employs this function in a clustering algorithm to identify similarity between data objects. The findings indicate that the proposed approach handles heterogeneous data better by forming well-separated clusters.

Download Full-text

Parallel Semi-Supervised Big Data Clustering Based on Mapreduce Technology

International Journal of Recent Technology and Engineering - 2 ◽

10.35940/ijrte.c5206.118419 ◽

2019 ◽

Vol 8 (4) ◽

pp. 1657-1664

Keyword(s):

Big Data ◽

Data Clustering ◽

Clustering Algorithm ◽

Graph Model ◽

Heterogeneous Data ◽

Initial Population ◽

Consensus Clustering ◽

Hidden Knowledge ◽

Intermediate Results ◽

Data Objects

In the area of information technology, a speedy sensational technology is big data. Big data brings tremendous challenges to extract valuable hidden knowledge. Data mining techniques can be used over big data to extract valuable knowledge for decision making. Big data results in high heterogeneity because it consists of various inter-related kinds of objects such as audios, texts, and images. In addition to this, the inter-related kinds of objects carry different information. So, in this paper clustering techniques are introduced to separate objects into several clusters. It also reduces the computational complexity of classifiers. A Possibilistic c-Means (PCM) algorithm was introduced to group the objects in big data. PCM replicated the characteristic of each object to different clusters effectively and it had capability to avoid the corruption of noise in the clustering process. However, PCM is not more efficient for big data and it cannot confine the complex correlation over multiple modalities of the heterogeneous data objects. So, a Parallel Semi-supervised Multi-Ant Colonies Clustering (PSMACC) is introduced for big data clustering. Initially, the PSMACC splits the data into number of partitions and each partition is processed in mappers. Each mapper generates a diverse collection of three clustering components using the semisupervised ant colony clustering algorithm with various moving speeds. Then, a hyper graph model was used to combine three clustering components. Finally, two constraints such as MustLink (ML) and Cannot-Link (CL) are included to form a consensus clustering. Finally, the intermediate results of each mapper are combined in the reducer. However, the overhead of iteration in PSMACC is overwhelming which affects the performance of PSMACC. So, a Parallel Semi-supervised MultiImperialist Competitive Algorithm (PSMICA) is proposed to cluster the big data. In PSMICA, each mapper processes the ICA where initial population is called countries. Some of the best countries in the population chosen as the imperialists and the remaining countries form the colonies of these imperialists. The colonies move towards the imperialists based on the distance between them. The intermediate results of each mapper are combined in reducer to get the final clustering result.

Download Full-text

EKEGWO: Enhanced Kernel-Based Exponential Grey Wolf Optimizer for Bi-Objective Data Clustering

International Journal of Uncertainty Fuzziness and Knowledge-Based Systems ◽

10.1142/s0218488519500296 ◽

2019 ◽

Vol 27 (04) ◽

pp. 669-688 ◽

Cited By ~ 1

Author(s):

Amolkumar Narayan Jadhav ◽

Gomathi N.

Keyword(s):

Data Clustering ◽

Clustering Algorithm ◽

Fitness Function ◽

Multidimensional Data ◽

Grey Wolf Optimizer ◽

Grey Wolf ◽

Widespread Application ◽

Clustering Techniques ◽

Cluster Distance ◽

F Measure

The widespread application of clustering in various fields leads to the discovery of different clustering techniques in order to partition multidimensional data into separable clusters. Although there are various clustering approaches used in literature, optimized clustering techniques with multi-objective consideration are rare. This paper proposes a novel data clustering algorithm, Enhanced Kernel-based Exponential Grey Wolf Optimization (EKEGWO), handling two objectives. EKEGWO, which is the extension of KEGWO, adopts weight exponential functions to improve the searching process of clustering. Moreover, the fitness function of the algorithm includes intra-cluster distance and the inter-cluster distance as an objective to provide an optimum selection of cluster centroids. The performance of the proposed technique is evaluated by comparing with the existing approaches PSC, mPSC, GWO, and EGWO for two datasets: banknote authentication and iris. Four metrics, Mean Square Error (MSE), F-measure, rand and jaccord coefficient, estimates the clustering efficiency of the algorithm. The proposed EKEGWO algorithm can attain an MSE of 837, F-measure of 0.9657, rand coefficient of 0.8472, jaccord coefficient of 0.7812, for the banknote dataset.

Download Full-text

A Fast K-prototypes Algorithm Using Partial Distance Computation

10.20944/preprints201704.0099.v1 ◽

2017 ◽

Author(s):

Byoungwook KIM

Keyword(s):

Minimum Distance ◽

Clustering Algorithm ◽

Cluster Center ◽

Maximum Difference ◽

Distance Computation ◽

Computational Performance ◽

Categorical Attributes ◽

Data Objects ◽

Numeric Data ◽

Numeric Attributes

The k-means is one of the most popular and widely used clustering algorithm, however, it is limited to only numeric data. The k-prototypes algorithm is one of the famous algorithms for dealing with both numeric and categorical data. However, there have been no studies to accelerate k-prototypes algorithm. In this paper, we propose a new fast k-prototypes algorithm that gives the same answer as original k-prototypes. The proposed algorithm avoids distance computations using partial distance computation. Our k-prototypes algorithm finds minimum distance without distance computations of all attributes between an object and a cluster center, which allows it to reduce time complexity. A partial distance computation uses a fact that a value of the maximum difference between two categorical attributes is 1 during distance computations. If data objects have m categorical attributes, maximum difference of categorical attributes between an object and a cluster center is m. Our algorithm first computes distance with only numeric attributes. If a difference of the minimum distance and the second smallest with numeric attributes is higher than m, we can find minimum distance between an object and a cluster center without distance computations of categorical attributes. The experimental shows proposed k-prototypes algorithm improves computational performance than original k-prototypes algorithm in our dataset.

Download Full-text

ADOFL: Multi-Kernel-Based Adaptive Directive Operative Fractional Lion Optimisation Algorithm for Data Clustering

Journal of Intelligent Systems ◽

10.1515/jisys-2016-0175 ◽

2018 ◽

Vol 27 (3) ◽

pp. 317-329 ◽

Cited By ~ 1

Author(s):

Satish Chander ◽

P. Vijaya ◽

Praveen Dhyani

Keyword(s):

Data Clustering ◽

Clustering Algorithm ◽

Fitness Function ◽

Clustering Algorithms ◽

Previous Method ◽

Optimisation Algorithm ◽

Lion Algorithm ◽

Similar Cluster ◽

Centroid Point ◽

Data Objects

Abstract The progress of databases in fields such as medical, business, education, marketing, etc., is colossal because of the developments in information technology. Knowledge discovery from such concealed bulk databases is a tedious task. For this, data mining is one of the promising solutions and clustering is one of its applications. The clustering process groups the data objects related to each other in a similar cluster and diverse objects in another cluster. The literature presents many clustering algorithms for data clustering. Optimisation-based clustering algorithm is one of the recently developed algorithms for the clustering process to discover the optimal cluster based on the objective function. In our previous method, direct operative fractional lion optimisation algorithm was proposed for data clustering. In this paper, we designed a new clustering algorithm called adaptive decisive operative fractional lion (ADOFL) optimisation algorithm based on multi-kernel function. Moreover, a new fitness function called multi-kernel WL index is proposed for the selection of the best centroid point for clustering. The experimentation of the proposed ADOFL algorithm is carried out over two benchmarked datasets, Iris and Wine. The performance of the proposed ADOFL algorithm is validated over existing clustering algorithms such as particle swarm clustering (PSC) algorithm, modified PSC algorithm, lion algorithm, fractional lion algorithm, and DOFL. The result shows that the maximum clustering accuracy of 79.51 is obtained by the proposed method in data clustering.

Download Full-text

A Clustering Algorithm for Multi-Modal Heterogeneous Big Data With Abnormal Data

Frontiers in Neurorobotics ◽

10.3389/fnbot.2021.680613 ◽

2021 ◽

Vol 15 ◽

Author(s):

An Yan ◽

Wei Wang ◽

Yi Ren ◽

HongWei Geng

Keyword(s):

Big Data ◽

Missing Data ◽

Bp Neural Network ◽

Data Clustering ◽

Clustering Algorithm ◽

Evaluation Study ◽

Heterogeneous Data ◽

Solution Approach ◽

Original Algorithm ◽

Network Method

The problems of data abnormalities and missing data are puzzling the traditional multi-modal heterogeneous big data clustering. In order to solve this issue, a multi-view heterogeneous big data clustering algorithm based on improved Kmeans clustering is established in this paper. At first, for the big data which involve heterogeneous data, based on multi view data analyzing, we propose an advanced Kmeans algorithm on the base of multi view heterogeneous system to determine the similarity detection metrics. Then, a BP neural network method is used to predict the missing attribute values, complete the missing data and restore the big data structure in heterogeneous state. Last, we ulteriorly propose a data denoising algorithm to denoise the abnormal data. Based on the above methods, we construct a framework namely BPK-means to resolve the problems of data abnormalities and missing data. Our solution approach is evaluated through rigorous performance evaluation study. Compared with the original algorithm, both theoretical verification and experimental results show that the accuracy of the proposed method is greatly improved.

Download Full-text

Clustering Mixed Data Based on Density Peaks and Stacked Denoising Autoencoders

Symmetry ◽

10.3390/sym11020163 ◽

2019 ◽

Vol 11 (2) ◽

pp. 163

Author(s):

Baobin Duan ◽

Lixin Han ◽

Zhinan Gou ◽

Yi Yang ◽

Shuangshuang Chen

Keyword(s):

Clustering Algorithm ◽

Local Density ◽

Clustering Algorithms ◽

Feature Space ◽

Original Data ◽

Mixed Data ◽

Feature Representations ◽

Density Peaks ◽

Categorical Attributes ◽

Data Objects

With the universal existence of mixed data with numerical and categorical attributes in real world, a variety of clustering algorithms have been developed to discover the potential information hidden in mixed data. Most existing clustering algorithms often compute the distances or similarities between data objects based on original data, which may cause the instability of clustering results because of noise. In this paper, a clustering framework is proposed to explore the grouping structure of the mixed data. First, the transformed categorical attributes by one-hot encoding technique and normalized numerical attributes are input to a stacked denoising autoencoders to learn the internal feature representations. Secondly, based on these feature representations, all the distances between data objects in feature space can be calculated and the local density and relative distance of each data object can be also computed. Thirdly, the density peaks clustering algorithm is improved and employed to allocate all the data objects into different clusters. Finally, experiments conducted on some UCI datasets have demonstrated that our proposed algorithm for clustering mixed data outperforms three baseline algorithms in terms of the clustering accuracy and the rand index.

Download Full-text

High dimensional data Clustering Algorithm Based on Sparse Feature Vector for Categorical Attributes

2010 International Conference on Logistics Systems and Intelligent Management (ICLSIM) ◽

10.1109/iclsim.2010.5461099 ◽

2010 ◽

Cited By ~ 2

Author(s):

Sen Wu ◽

Guiying Wei

Keyword(s):

Data Clustering ◽

Clustering Algorithm ◽

Feature Vector ◽

High Dimensional Data ◽

High Dimensional ◽

Categorical Attributes

Download Full-text

Balanced Data Clustering Algorithm for Both Hard and Soft Clustering

International Journal of Computer Sciences and Engineering ◽

10.26438/ijcse/v6i2.176183 ◽

2018 ◽

Vol 6 (2) ◽

pp. 176-183

Author(s):

Purnendu Das ◽

◽

Bishwa Ranjan Roy ◽

Saptarshi Paul ◽

◽

...

Keyword(s):

Data Clustering ◽

Clustering Algorithm ◽

Soft Clustering

Download Full-text

A Similarity Measurement with Entropy-Based Weighting for Clustering Mixed Numerical and Categorical Datasets

Algorithms ◽

10.3390/a14060184 ◽

2021 ◽

Vol 14 (6) ◽

pp. 184

Author(s):

Xia Que ◽

Siyuan Jiang ◽

Jiaoyun Yang ◽

Ning An

Keyword(s):

Categorical Data ◽

Clustering Algorithm ◽

Numerical Data ◽

Similarity Measurement ◽

Amount Of Information ◽

Automatic Categorization ◽

Categorical Attributes ◽

Weighting Strategy

Many mixed datasets with both numerical and categorical attributes have been collected in various fields, including medicine, biology, etc. Designing appropriate similarity measurements plays an important role in clustering these datasets. Many traditional measurements treat various attributes equally when measuring the similarity. However, different attributes may contribute differently as the amount of information they contained could vary a lot. In this paper, we propose a similarity measurement with entropy-based weighting for clustering mixed datasets. The numerical data are first transformed into categorical data by an automatic categorization technique. Then, an entropy-based weighting strategy is applied to denote the different importances of various attributes. We incorporate the proposed measurement into an iterative clustering algorithm, and extensive experiments show that this algorithm outperforms OCIL and K-Prototype methods with 2.13% and 4.28% improvements, respectively, in terms of accuracy on six mixed datasets from UCI.

Download Full-text

Tree-ART2 Learning Model for Spatial Clustering in Second Dimension

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.543-547.1934 ◽

2014 ◽

Vol 543-547 ◽

pp. 1934-1938

Author(s):

Ming Xiao

Keyword(s):

Network Model ◽

Spatial Data ◽

Data Clustering ◽

Clustering Algorithm ◽

Spatial Clustering ◽

Adaptive Resonance Theory ◽

Spatial Distance ◽

Resonance Theory ◽

Adaptive Resonance ◽

Vector Module

For a clustering algorithm in two-dimension spatial data, the Adaptive Resonance Theory exists not only the shortcomings of pattern drift and vector module of information missing, but also difficultly adapts to spatial data clustering which is irregular distribution. A Tree-ART2 network model was proposed based on the above situation. It retains the memory of old model which maintains the constraint of spatial distance by learning and adjusting LTM pattern and amplitude information of vector. Meanwhile, introducing tree structure to the model can reduce the subjective requirement of vigilance parameter and decrease the occurrence of pattern mixing. It is showed that TART2 network has higher plasticity and adaptability through compared experiments.

Download Full-text