Discovering Similarity Across Heterogeneous Features

2020 ◽  
Vol 16 (4) ◽  
pp. 63-83
Author(s):  
Vandana P. Janeja ◽  
Josephine M. Namayanja ◽  
Yelena Yesha ◽  
Anuja Kench ◽  
Vasundhara Misal

The analysis of both continuous and categorical attributes generating a heterogeneous mix of attributes poses challenges in data clustering. Traditional clustering techniques like k-means clustering work well when applied to small homogeneous datasets. However, as the data size becomes large, it becomes increasingly difficult to find meaningful and well-formed clusters. In this paper, the authors propose an approach that utilizes a combined similarity function, which looks at similarity across numeric and categorical features and employs this function in a clustering algorithm to identify similarity between data objects. The findings indicate that the proposed approach handles heterogeneous data better by forming well-separated clusters.

2019 ◽  
Vol 8 (4) ◽  
pp. 1657-1664

In the area of information technology, a speedy sensational technology is big data. Big data brings tremendous challenges to extract valuable hidden knowledge. Data mining techniques can be used over big data to extract valuable knowledge for decision making. Big data results in high heterogeneity because it consists of various inter-related kinds of objects such as audios, texts, and images. In addition to this, the inter-related kinds of objects carry different information. So, in this paper clustering techniques are introduced to separate objects into several clusters. It also reduces the computational complexity of classifiers. A Possibilistic c-Means (PCM) algorithm was introduced to group the objects in big data. PCM replicated the characteristic of each object to different clusters effectively and it had capability to avoid the corruption of noise in the clustering process. However, PCM is not more efficient for big data and it cannot confine the complex correlation over multiple modalities of the heterogeneous data objects. So, a Parallel Semi-supervised Multi-Ant Colonies Clustering (PSMACC) is introduced for big data clustering. Initially, the PSMACC splits the data into number of partitions and each partition is processed in mappers. Each mapper generates a diverse collection of three clustering components using the semisupervised ant colony clustering algorithm with various moving speeds. Then, a hyper graph model was used to combine three clustering components. Finally, two constraints such as MustLink (ML) and Cannot-Link (CL) are included to form a consensus clustering. Finally, the intermediate results of each mapper are combined in the reducer. However, the overhead of iteration in PSMACC is overwhelming which affects the performance of PSMACC. So, a Parallel Semi-supervised MultiImperialist Competitive Algorithm (PSMICA) is proposed to cluster the big data. In PSMICA, each mapper processes the ICA where initial population is called countries. Some of the best countries in the population chosen as the imperialists and the remaining countries form the colonies of these imperialists. The colonies move towards the imperialists based on the distance between them. The intermediate results of each mapper are combined in reducer to get the final clustering result.


Author(s):  
Amolkumar Narayan Jadhav ◽  
Gomathi N.

The widespread application of clustering in various fields leads to the discovery of different clustering techniques in order to partition multidimensional data into separable clusters. Although there are various clustering approaches used in literature, optimized clustering techniques with multi-objective consideration are rare. This paper proposes a novel data clustering algorithm, Enhanced Kernel-based Exponential Grey Wolf Optimization (EKEGWO), handling two objectives. EKEGWO, which is the extension of KEGWO, adopts weight exponential functions to improve the searching process of clustering. Moreover, the fitness function of the algorithm includes intra-cluster distance and the inter-cluster distance as an objective to provide an optimum selection of cluster centroids. The performance of the proposed technique is evaluated by comparing with the existing approaches PSC, mPSC, GWO, and EGWO for two datasets: banknote authentication and iris. Four metrics, Mean Square Error (MSE), F-measure, rand and jaccord coefficient, estimates the clustering efficiency of the algorithm. The proposed EKEGWO algorithm can attain an MSE of 837, F-measure of 0.9657, rand coefficient of 0.8472, jaccord coefficient of 0.7812, for the banknote dataset.


Author(s):  
Byoungwook KIM

The k-means is one of the most popular and widely used clustering algorithm, however, it is limited to only numeric data. The k-prototypes algorithm is one of the famous algorithms for dealing with both numeric and categorical data. However, there have been no studies to accelerate k-prototypes algorithm. In this paper, we propose a new fast k-prototypes algorithm that gives the same answer as original k-prototypes. The proposed algorithm avoids distance computations using partial distance computation. Our k-prototypes algorithm finds minimum distance without distance computations of all attributes between an object and a cluster center, which allows it to reduce time complexity. A partial distance computation uses a fact that a value of the maximum difference between two categorical attributes is 1 during distance computations. If data objects have m categorical attributes, maximum difference of categorical attributes between an object and a cluster center is m. Our algorithm first computes distance with only numeric attributes. If a difference of the minimum distance and the second smallest with numeric attributes is higher than m, we can find minimum distance between an object and a cluster center without distance computations of categorical attributes. The experimental shows proposed k-prototypes algorithm improves computational performance than original k-prototypes algorithm in our dataset.


2018 ◽  
Vol 27 (3) ◽  
pp. 317-329 ◽  
Author(s):  
Satish Chander ◽  
P. Vijaya ◽  
Praveen Dhyani

Abstract The progress of databases in fields such as medical, business, education, marketing, etc., is colossal because of the developments in information technology. Knowledge discovery from such concealed bulk databases is a tedious task. For this, data mining is one of the promising solutions and clustering is one of its applications. The clustering process groups the data objects related to each other in a similar cluster and diverse objects in another cluster. The literature presents many clustering algorithms for data clustering. Optimisation-based clustering algorithm is one of the recently developed algorithms for the clustering process to discover the optimal cluster based on the objective function. In our previous method, direct operative fractional lion optimisation algorithm was proposed for data clustering. In this paper, we designed a new clustering algorithm called adaptive decisive operative fractional lion (ADOFL) optimisation algorithm based on multi-kernel function. Moreover, a new fitness function called multi-kernel WL index is proposed for the selection of the best centroid point for clustering. The experimentation of the proposed ADOFL algorithm is carried out over two benchmarked datasets, Iris and Wine. The performance of the proposed ADOFL algorithm is validated over existing clustering algorithms such as particle swarm clustering (PSC) algorithm, modified PSC algorithm, lion algorithm, fractional lion algorithm, and DOFL. The result shows that the maximum clustering accuracy of 79.51 is obtained by the proposed method in data clustering.


2021 ◽  
Vol 15 ◽  
Author(s):  
An Yan ◽  
Wei Wang ◽  
Yi Ren ◽  
HongWei Geng

The problems of data abnormalities and missing data are puzzling the traditional multi-modal heterogeneous big data clustering. In order to solve this issue, a multi-view heterogeneous big data clustering algorithm based on improved Kmeans clustering is established in this paper. At first, for the big data which involve heterogeneous data, based on multi view data analyzing, we propose an advanced Kmeans algorithm on the base of multi view heterogeneous system to determine the similarity detection metrics. Then, a BP neural network method is used to predict the missing attribute values, complete the missing data and restore the big data structure in heterogeneous state. Last, we ulteriorly propose a data denoising algorithm to denoise the abnormal data. Based on the above methods, we construct a framework namely BPK-means to resolve the problems of data abnormalities and missing data. Our solution approach is evaluated through rigorous performance evaluation study. Compared with the original algorithm, both theoretical verification and experimental results show that the accuracy of the proposed method is greatly improved.


Symmetry ◽  
2019 ◽  
Vol 11 (2) ◽  
pp. 163
Author(s):  
Baobin Duan ◽  
Lixin Han ◽  
Zhinan Gou ◽  
Yi Yang ◽  
Shuangshuang Chen

With the universal existence of mixed data with numerical and categorical attributes in real world, a variety of clustering algorithms have been developed to discover the potential information hidden in mixed data. Most existing clustering algorithms often compute the distances or similarities between data objects based on original data, which may cause the instability of clustering results because of noise. In this paper, a clustering framework is proposed to explore the grouping structure of the mixed data. First, the transformed categorical attributes by one-hot encoding technique and normalized numerical attributes are input to a stacked denoising autoencoders to learn the internal feature representations. Secondly, based on these feature representations, all the distances between data objects in feature space can be calculated and the local density and relative distance of each data object can be also computed. Thirdly, the density peaks clustering algorithm is improved and employed to allocate all the data objects into different clusters. Finally, experiments conducted on some UCI datasets have demonstrated that our proposed algorithm for clustering mixed data outperforms three baseline algorithms in terms of the clustering accuracy and the rand index.


2018 ◽  
Vol 6 (2) ◽  
pp. 176-183
Author(s):  
Purnendu Das ◽  
◽  
Bishwa Ranjan Roy ◽  
Saptarshi Paul ◽  
◽  
...  

Algorithms ◽  
2021 ◽  
Vol 14 (6) ◽  
pp. 184
Author(s):  
Xia Que ◽  
Siyuan Jiang ◽  
Jiaoyun Yang ◽  
Ning An

Many mixed datasets with both numerical and categorical attributes have been collected in various fields, including medicine, biology, etc. Designing appropriate similarity measurements plays an important role in clustering these datasets. Many traditional measurements treat various attributes equally when measuring the similarity. However, different attributes may contribute differently as the amount of information they contained could vary a lot. In this paper, we propose a similarity measurement with entropy-based weighting for clustering mixed datasets. The numerical data are first transformed into categorical data by an automatic categorization technique. Then, an entropy-based weighting strategy is applied to denote the different importances of various attributes. We incorporate the proposed measurement into an iterative clustering algorithm, and extensive experiments show that this algorithm outperforms OCIL and K-Prototype methods with 2.13% and 4.28% improvements, respectively, in terms of accuracy on six mixed datasets from UCI.


2014 ◽  
Vol 543-547 ◽  
pp. 1934-1938
Author(s):  
Ming Xiao

For a clustering algorithm in two-dimension spatial data, the Adaptive Resonance Theory exists not only the shortcomings of pattern drift and vector module of information missing, but also difficultly adapts to spatial data clustering which is irregular distribution. A Tree-ART2 network model was proposed based on the above situation. It retains the memory of old model which maintains the constraint of spatial distance by learning and adjusting LTM pattern and amplitude information of vector. Meanwhile, introducing tree structure to the model can reduce the subjective requirement of vigilance parameter and decrease the occurrence of pattern mixing. It is showed that TART2 network has higher plasticity and adaptability through compared experiments.


Sign in / Sign up

Export Citation Format

Share Document