Comparison of Distance Measures in Cluster Analysis with Dichotomous Data

Although distance measures are used in many machine learning algorithms, the literature on the context-independent selection and evaluation of distance measures is limited in the sense that prior knowledge is used. In cluster analysis, current studies evaluate the choice of distance measure after applying unsupervised methods based on error probabilities, implicitly setting the goal of reproducing predefined partitions in data. Such studies use clusters of data that are often based on the context of the data as well as the custom goal of the specific study. Depending on the data context, different properties for distance distributions are judged to be relevant for appropriate distance selection. However, if cluster analysis is based on the task of finding similar partitions of data, then the intrapartition distances should be smaller than the interpartition distances. By systematically investigating this specification using distribution analysis through the mirrored-density (MD plot), it is shown that multimodal distance distributions are preferable in cluster analysis. As a consequence, it is advantageous to model distance distributions with Gaussian mixtures prior to the evaluation phase of unsupervised methods. Experiments are performed on several artificial datasets and natural datasets for the task of clustering.

Download Full-text

A Comparative Fuzzy Cluster Analysis of the Binder’s Performance Grades Using Fuzzy Equivalence Relation via Different Distance Measures

Communications in Computer and Information Science - Advanced Informatics for Computing Research ◽

10.1007/978-981-13-3140-4_11 ◽

2018 ◽

pp. 108-118

Author(s):

Rajesh Kumar Chandrawat ◽

Rakesh Kumar ◽

Varinda Makkar ◽

Manisha Yadav ◽

Pratibha Kumari

Keyword(s):

Cluster Analysis ◽

Equivalence Relation ◽

Fuzzy Cluster ◽

Distance Measures ◽

Fuzzy Cluster Analysis

Download Full-text

Signal-to-noise sensitivity of distance measures in hierarchical cluster analysis for Raman spectral imaging

10.1117/12.2615730 ◽

2021 ◽

Author(s):

Ann-Kathrin Kniggendorf ◽

Regina Nogueira ◽

Bernhard Roth

Keyword(s):

Cluster Analysis ◽

Hierarchical Cluster Analysis ◽

Spectral Imaging ◽

Hierarchical Cluster ◽

Distance Measures ◽

Noise Sensitivity ◽

Signal To Noise ◽

Raman Spectral

Download Full-text

Microsatellite Markers Developed from `Bluecrop' Reveal Polymorphisms in the Genus Vaccinium and Are Suitable for Cultivar Fingerprinting

HortScience ◽

10.21273/hortsci.40.4.1122b ◽

2005 ◽

Vol 40 (4) ◽

pp. 1122B-1122 ◽

Cited By ~ 1

Author(s):

Peter Boches ◽

Lisa J. Rowland ◽

Kim Hummer ◽

Nahla V. Bassil

Keyword(s):

Genetic Diversity ◽

Cluster Analysis ◽

Genetic Distance ◽

Microsatellite Markers ◽

Genetic Relationships ◽

Distance Measures ◽

Hybrid Origin ◽

Pedigree Information ◽

Species Relationships ◽

Ssr Loci

Microsatellite markers for blueberry (Vaccinium L.) were created from a preexisting blueberry expressed sequence tag (EST) library of 1305 sequences and a microsatellite-enriched genomic library of 136 clones. Microsatellite primers for 65 EST-containing simple sequence repeats (SSRs) and 29 genomic SSR were initially tested for amplification and polymorphism on agarose gels. Potential usefulness of these SSRs for estimating species relationships in the genus was assessed through cross-species transference of 45 SSR loci and cluster analysis using genetic distance values from five highly polymorphic EST-SSR loci. Cross-species amplification for 45 SSR loci ranged from 17% to 100%, and was 83% on average in nine sections. Cluster analysis of 59 Vaccinium species based on genetic distance measures obtained from 5 EST-SSR loci supported the concept of V. elliotii Chapm. as a genetically distinct diploid highbush species and indicated that V. ashei Reade is of hybrid origin. Twenty EST-SSR and 10 genomic microsatellite loci were used to determine genetic diversity in 72 tetraploid V. corymbosum L. accessions consisting mostly of common cultivars. Unique fingerprints were obtained for all accessions analyzed. Genetic relationships, based on microsatellites, corresponded well with known pedigree information. Most modern cultivars clustered closely together, but southern highbush and northern highbush cultivars were sufficiently differentiated to form distinct clusters. Future use of microsatellites in Vaccinium will help resolve species relationships in the genus, estimate genetic diversity in the National Clonal Germplasm Repository (NCGR) collection, and confirm the identity of clonal germplasm accessions.

Download Full-text

Comparison of Cluster Validity Index and Distance Measures Using Integrated Cluster Analysis and Structural Equation Modeling the Warp-PLS Approach

Journal of Southwest Jiaotong University ◽

10.35741/issn.0258-2724.56.3.13 ◽

2021 ◽

Vol 56 (3) ◽

pp. 157-168

Author(s):

Adji Achmad Rinaldo Fernandes ◽

Solimun ◽

Nurjannah ◽

Usfi Al Imama Billah ◽

Ni Made Ayu Astari Badung

Keyword(s):

Cluster Analysis ◽

Service Quality ◽

Willingness To Pay ◽

Distance Measures ◽

Primary Data ◽

Equation Modeling ◽

Average Score ◽

Cluster Validity ◽

Cluster Validity Indices ◽

Validity Indices

This study wants to compare the Integrated Cluster Analysis and SEM model of the Warp-PLS approach with various cluster validity indices and distance measures on Service Quality, Environment, Fashions, Willingness to Pay, and Compliant Paying Behavior of Bank X Customers. The data used in this study are primary. The variables used in this study are service quality, environment, fashion, willingness to pay, and compliance with paying behavior at Bank X. The data were obtained through a questionnaire with a Likert scale — measurement of variables in primary data using the average score of each item. The sampling technique used was purposive sampling. The object of observation is the customer as many as 100 respondents. Data analysis was carried out quantitatively, and a descriptive analysis was carried out first. An Integrated Cluster Analysis and SEM analysis of the Warp-PLS approach was carried out with the average linkage method on various cluster validity indices and three distance measures. The Warp-PLS approach's integrated cluster and SEM model with the Gap Index, Index C, Global Sillhouette, and Goodman-Kruskal with the Manhattan Distance are better than the Gap, Index C, Global Sillhouette, and Goodman-Kruskal with the Euclidean and Minkowski Distance. The novelty in this research is the application of Integrated Cluster Analysis and SEM of the Warp-PLS approach to compare 4 cluster validity indices, namely Gap Index, C Index, Global Sillhouette, and Goodman-Kruskal, and three distance measures, namely Euclidean, Manhattan, and Minkowski distances simultaneously.

Download Full-text

Optimized consortium formation through cluster analysis

Problems and Perspectives in Management ◽

10.21511/ppm.14(1).2016.13 ◽

2016 ◽

Vol 14 (1) ◽

pp. 117-126 ◽

Cited By ~ 1

Author(s):

Kgwadi M. Mampana ◽

Solly M. Seeletse ◽

Enoch M. Sithole

Keyword(s):

Cluster Analysis ◽

Optimal Solution ◽

Fixed Number ◽

Distance Measures ◽

Single Procedure

Some problems cannot be solved optimally and compromises become necessary. In some cases obtaining an optimal solution may require combining algorithms and iterations. This often occurs when the problem is complex and a single procedure does not reach optimality. This paper shows a conglomerate of algorithms iterated in tasks to form an optimal consortium using cluster analysis. Hierarchical methods and distance measures lead the process. Few companies are desirable in optimal consortium formation. However, this study shows that optimization cannot be predetermined based on a specific fixed number of companies. The experiential exercise forms an optimal consortium of four companies from six shortlisted competitors

Download Full-text

Associations among similarity and distance measures for binary data in cluster analysis

Advances in Methodology and Statistics ◽

10.51936/yelx5179 ◽

2020 ◽

Vol 17 (1) ◽

Author(s):

Jana Cibulková ◽

Zdenek Šulc ◽

Hana Řezanková ◽

Sergej Sirota

Keyword(s):

Cluster Analysis ◽

Hierarchical Clustering ◽

Hierarchical Cluster Analysis ◽

Binary Data ◽

Evaluation Criteria ◽

Similarity Measures ◽

Hierarchical Cluster ◽

Distance Measures ◽

External Evaluation ◽

Insight Into

The paper focuses on similarity and distance measures for binary data and their application in cluster analysis. There are 66 measures for binary data analyzed in the paper in order to provide a comprehensive insight into the problematics and to create their well-arranged overview. For this purpose, formulas by which they were defined are studied. In the next part of the research, the results of object clustering on generated datasets are compared, and the ability of measures to create similar or identical clustering solutions is evaluated. This is done by using chosen internal and external evaluation criteria, and comparing the assignments of objects into clusters in the process of hierarchical clustering. The paper shows which similarity measures and distance measures for binary data lead to similar or even identical results in hierarchical cluster analysis.

Download Full-text

METODE k-MEDOIDS CLUSTERING DENGAN VALIDASI SILHOUETTE INDEX DAN C-INDEX (Studi Kasus Jumlah Kriminalitas Kabupaten/Kota di Jawa Tengah Tahun 2018)

Jurnal Gaussian ◽

10.14710/j.gauss.v8i2.26640 ◽

2019 ◽

Vol 8 (2) ◽

pp. 161-170

Author(s):

Milla Alifatun Nahdliyah ◽

Tatik Widiharih ◽

Alan Prahutama

Keyword(s):

Cluster Analysis ◽

Euclidean Distance ◽

Clustering Algorithm ◽

Distance Measures ◽

Cluster Center ◽

Silhouette Width ◽

Central Java ◽

Silhouette Index ◽

Optimal Cluster

The k-medoids method is a non-hierarchical clustering to classify n object into k clusters that have the same characteristics. This clustering algorithm uses the medoid as its cluster center. Medoid is the most centrally located object in a cluster, so it’s robust to outliers. In cluster analysis the objects are grouped by the similarity. To measure the similarity, it can be used distance measures, euclidean distance and cityblock distance. The distance that is used in cluster analysis can affect the clustering results. Then, to determine the quality of the clustering results can be used the internal criteria with silhouette width and C-index. In this research the k-medoids method to classify of regencies/cities in Central Java based on type and number of crimes. The optimal cluster at k= 4 use euclidean distance, where the silhouette index= 0,3862593 and C-index= 0,043893. Keywords: Clustering, k-Medoids, Euclidean distance, Cityblock distance, Silhouette index, C-index, Crime

Download Full-text

Clusteranalysis as a swine farm qualifying method

Acta Agraria Debreceniensis ◽

10.34101/actaagrar/27/3121 ◽

2007 ◽

pp. 165-174

Author(s):

Sándor Kovács ◽

Péter Balogh

Keyword(s):

Cluster Analysis ◽

Statistical Methods ◽

Theoretical Background ◽

Distance Measures ◽

Swine Farm ◽

Hierarchical Method

Cluster Analysis is one of the most favorite multivariable statistical methods, which is actually a special type of aggregating method. Observations are clustered by variables belonged to the observations. Our purpose is to create such clusters, in which the elements are the most similar, and between the clusters they are the most variant. For example these clusters could be the qualitative classifications of farms.There have been several methods in Cluster Analysis as well as numerous distance measures, which could be used. In this article, we study all of these methods and measures. After we show the theoretical background, we apply the method in a given casestudy to control the qualitative classifications of experts. In this study, we use both the hierarchical and the non-hierarchical method, and also compare them. We would like to attract the attention that the most important problem of the analysis is to determine the optimal of clusters.

Download Full-text