Impact of Distance Measures on the Performance of Clustering Algorithms

Author(s):  
Vijay Kumar ◽  
Jitender Kumar Chhabra ◽  
Dinesh Kumar
2013 ◽  
Vol 12 (5) ◽  
pp. 3443-3451
Author(s):  
Rajesh Pasupuleti ◽  
Narsimha Gugulothu

Clustering analysis initiatives  a new direction in data mining that has major impact in various domains including machine learning, pattern recognition, image processing, information retrieval and bioinformatics. Current clustering techniques address some of the  requirements not adequately and failed in standardizing clustering algorithms to support for all real applications. Many clustering methods mostly depend on user specified parametric methods and initial seeds of clusters are randomly selected by  user.  In this paper, we proposed new clustering method based on linear approximation of function by getting over all idea of behavior knowledge of clustering function, then pick the initial seeds of clusters as the points on linear approximation line and perform clustering operations, unlike grouping data objects into clusters by using distance measures, similarity measures and statistical distributions in traditional clustering methods. We have shown experimental results as clusters based on linear approximation yields good  results in practice with an example of  business data are provided.  It also  explains privacy preserving clusters of sensitive data objects.


2019 ◽  
Vol 37 (2) ◽  
pp. 172-179 ◽  
Author(s):  
Gisely Paula Gomes ◽  
Viviane Yumi Baba ◽  
Odair P dos Santos ◽  
Cláudia P Sudré ◽  
Cintia dos S Bento ◽  
...  

ABSTRACT Characterization and evaluation of genotypes conserved in the germplasm banks have become of great importance due to gradual loss of genetic variability and search for more adapted and productive genotypes. This can be obtained through several ways, generating quantitative and qualitative data. Joint analysis of those variables may be considered a strategy for an accurate germplasm characterization. In this study we aimed to evaluate different clustering techniques for characterization and evaluation of Capsicum spp. accessions using combinations of specific measures for quantitative and qualitative variables. A collection of 56 Capsicum spp. accessions was characterized based on 25 morphoagronomic descriptors. Six quantitative distances were used [A1) average of the range-standardized absolute difference (Gower), A2) Pearson correlation, A3) Kulczynski, A4) Canberra, A5) Bray-Curtis, and A6) Morisita] combined with distance for qualitative data [Simple Coincidence (B1)]. Clustering analyses were performed using agglomerative hierarchical methods (Ward, the nearest neighbor, the farthest neighbor, UPGMA and WPGMA). All combined distances were highly correlated. UPGMA clustering was the most efficient through cophenetic correlation and 2-norm analyses, showing a concordance between the two methods. Six clusters were considered an ideal number by UPGMA clustering, in which Gower distance showed a better adjustment for clustering. Most combined distances using UPGMA clustering allowed the separation of the accessions in relation to species, using both quantitative and qualitative data, which could be an alternative for simultaneous joint analysis, aiming to compare different clusters.


Clustering is a data mining task devoted to the automatic grouping of data based on mutual similarity. Clustering in high-dimensional spaces is a recurrent problem in many domains. It affects time complexity, space complexity, scalability and accuracy of clustering methods. Highdimensional non-linear datausually live in different low dimensional subspaces hidden in the original space. As high‐dimensional objects appear almost alike, new approaches for clustering are required. This research has focused on developing Mathematical models, techniques and clustering algorithms specifically for high‐dimensional data. The innocent growth in the fields of communication and technology, there is tremendous growth in high dimensional data spaces. As the variant of dimensions on high dimensional non-linear data increases, many clustering techniques begin to suffer from the curse of dimensionality, de-grading the quality of the results. In high dimensional non-linear data, the data becomes very sparse and distance measures become increasingly meaningless. The principal challenge for clustering high dimensional data is to overcome the “curse of dimensionality”. This research work concentrates on devising an enhanced algorithm for clustering high dimensional non-linear data.


Author(s):  
T. Gayathri ◽  
D. Lalitha Bhaskari

“Big data” as the name suggests is a collection of large and complicated data sets which are usually hard to process with on-hand data management tools or other conventional processing applications. A scalable signature based subspace clustering approach is presented in this article that would avoid identification of redundant clusters. Various distance measures are utilized to perform experiments that validate the performance of the proposed algorithm. Also, for the same purpose of validation, the synthetic data sets that are chosen have different dimensions, and their size will be distributed when opened with Weka. The F1 quality measure and the runtime of these synthetic data sets are computed. The performance of the proposed algorithm is compared with other existing clustering algorithms such as CLIQUE.INSCY and SUNCLU.


Author(s):  
B.Hari Babu ◽  
N.Subash Chandra ◽  
T. Venu Gopal

Clustering is the most prominent data mining technique used for grouping the data into clusters based on distance measures. With the advent growth of high dimensional data such as microarray gene expression data, and grouping high dimensional data into clusters will encounter the similarity between the objects in the full dimensional space is often invalid because it contains different types of data. The process of grouping into high dimensional data into clusters is not accurate and perhaps not up to the level of expectation when the dimension of the dataset is high. It is now focusing tremendous attention towards research and development. The performance issues of the data clustering in high dimensional data it is necessary to study issues like dimensionality reduction, redundancy elimination, subspace clustering, co-clustering and data labeling for clusters are to analyzed and improved. In this paper, we presented a brief comparison of the existing algorithms that were mainly focusing at clustering on high dimensional data.


Author(s):  
Dhayanithi Jaganathan ◽  
Akilandeswari Jeyapal

In recent days, researchers are doing research studies for clustering of data which are heterogeneous in nature. The data generated in many real-world applications like data form IoT environments and big data domains are heterogeneous in nature. Most of the available clustering algorithms deal with data in homogeneous nature, and there are few algorithms discussed in the literature to deal the data with numeric and categorical nature. Applying the clustering algorithm used by homogenous data to the heterogeneous data leads to information loss. This chapter proposes a new genetically-modified k-medoid clustering algorithm (GMODKMD) which takes fused distance matrix as input that adopts from applying individual distance measures for each attribute based on its characteristics. The GMODKMD is a modified algorithm where Davies Boudlin index is applied in the iteration phase. The proposed algorithm is compared with existing techniques based on accuracy. The experimental result shows that the modified algorithm with fused distance matrix outperforms the existing clustering technique.


2015 ◽  
Vol 2015 ◽  
pp. 1-17 ◽  
Author(s):  
Arindam Chaudhuri

Intuitionistic fuzzy sets (IFSs) provide mathematical framework based on fuzzy sets to describe vagueness in data. It finds interesting and promising applications in different domains. Here, we develop an intuitionistic fuzzy possibilistic C means (IFPCM) algorithm to cluster IFSs by hybridizing concepts of FPCM, IFSs, and distance measures. IFPCM resolves inherent problems encountered with information regarding membership values of objects to each cluster by generalizing membership and nonmembership with hesitancy degree. The algorithm is extended for clustering interval valued intuitionistic fuzzy sets (IVIFSs) leading to interval valued intuitionistic fuzzy possibilistic C means (IVIFPCM). The clustering algorithm has membership and nonmembership degrees as intervals. Information regarding membership and typicality degrees of samples to all clusters is given by algorithm. The experiments are performed on both real and simulated datasets. It generates valuable information and produces overlapped clusters with different membership degrees. It takes into account inherent uncertainty in information captured by IFSs. Some advantages of algorithms are simplicity, flexibility, and low computational complexity. The algorithm is evaluated through cluster validity measures. The clustering accuracy of algorithm is investigated by classification datasets with labeled patterns. The algorithm maintains appreciable performance compared to other methods in terms of pureness ratio.


2020 ◽  
Vol 6 (10) ◽  
Author(s):  
James Robertson ◽  
Kyrylo Bessonov ◽  
Justin Schonfeld ◽  
John H. E. Nash

Bacterial plasmids play a large role in allowing bacteria to adapt to changing environments and can pose a significant risk to human health if they confer virulence and antimicrobial resistance (AMR). Plasmids differ significantly in the taxonomic breadth of host bacteria in which they can successfully replicate, this is commonly referred to as ‘host range’ and is usually described in qualitative terms of ‘narrow’ or ‘broad’. Understanding the host range potential of plasmids is of great interest due to their ability to disseminate traits such as AMR through bacterial populations and into human pathogens. We developed the MOB-suite to facilitate characterization of plasmids and introduced a whole-sequence-based classification system based on clustering complete plasmid sequences using Mash distances (https://github.com/phac-nml/mob-suite). We updated the MOB-suite database from 12 091 to 23 671 complete sequences, representing 17 779 unique plasmids. With advances in new algorithms for rapidly calculating average nucleotide identity (ANI), we compared clustering characteristics using two different distance measures – Mash and ANI – and three clustering algorithms on the unique set of plasmids. The plasmid nomenclature is designed to group highly similar plasmids together that are unlikely to have multiple representatives within a single cell. Based on our results, we determined that clusters generated using Mash and complete-linkage clustering at a Mash distance of 0.06 resulted in highly homogeneous clusters while maintaining cluster size. The taxonomic distribution of plasmid biomarker sequences for replication and relaxase typing, in combination with MOB-suite whole-sequence-based clusters have been examined in detail for all high-quality publicly available plasmid sequences. We have incorporated prediction of plasmid replication host range into the MOB-suite based on observed distributions of these sequence features in combination with known plasmid hosts from the literature. Host range is reported as the highest taxonomic rank that covers all of the plasmids which share replicon or relaxase biomarkers or belong to the same MOB-suite cluster code. Reporting host range based on these criteria allows for comparisons of host range between studies and provides information for plasmid surveillance.


2021 ◽  
Vol 11 (8) ◽  
pp. 3693
Author(s):  
Alberto Blazquez-Herranz ◽  
Juan-Ignacio Caballero-Garzon ◽  
Albert Zilverberg ◽  
Christian Wolff ◽  
Alejandro Rodríguez-Gonzalez ◽  
...  

Mobile devices equipped with sensors are generating an amount of geo-spatial related data that, properly analyzed can be used for future applications. In particular, being able to establish similar trajectories is crucial to analyze events on common points in the trajectories. CROSS-CPP is a European project whose main aim is to provide tools to store data in a data market and to have a toolbox to analyze the data. As part of these analytic tools, a set of functionalities has been developed to cluster trajectories. Based on previous work on clustering algorithms we present in this paper a Quickbundels algorithm adaptation to trajectory clustering . Experiments using different distance measures show that Quickbundles outperforms spectral clustering, with the WS84 geodesic distance being the one that provides the best results.


2021 ◽  
Author(s):  
Preethy Sambamoorthy

In most of the current research works on Quality of Service (QoS) based web service selection, searching is usually the dominant way to find the desired services. This approach comes with the potential problem of framing search queries properly due to requestor's lack of knowledge or vague requirement about QoS attribute values. In this thesis, we propose an interactive QoS browsing mechanism that uses the concept of clustering to present the QoS value distribution to requestors followed by finer views of service quality. By analyzing various QoS attributes, we believe that the symbolic interval data is a proper type of representation, compared with the single valued numerical data. Therefore, we use interval data clustering algorithms to implement our browsing system. We conducted experiments on simulated QoS datasets to compare the performance of using different distance measures and show the effectiveness of the interval data clustering algorithm used. The result of the experiments show that the proposed approach provides an effective, user guided QoS based service selection approach that can conceivably overcome the problems with current approaches.


Sign in / Sign up

Export Citation Format

Share Document