scholarly journals Comparison of Distance Measures in Cluster Analysis with Dichotomous Data

2021 ◽  
Vol 3 (1) ◽  
pp. 85-100
Author(s):  
Holmes Finch
Author(s):  
Michael C. Thrun

Although distance measures are used in many machine learning algorithms, the literature on the context-independent selection and evaluation of distance measures is limited in the sense that prior knowledge is used. In cluster analysis, current studies evaluate the choice of distance measure after applying unsupervised methods based on error probabilities, implicitly setting the goal of reproducing predefined partitions in data. Such studies use clusters of data that are often based on the context of the data as well as the custom goal of the specific study. Depending on the data context, different properties for distance distributions are judged to be relevant for appropriate distance selection. However, if cluster analysis is based on the task of finding similar partitions of data, then the intrapartition distances should be smaller than the interpartition distances. By systematically investigating this specification using distribution analysis through the mirrored-density (MD plot), it is shown that multimodal distance distributions are preferable in cluster analysis. As a consequence, it is advantageous to model distance distributions with Gaussian mixtures prior to the evaluation phase of unsupervised methods. Experiments are performed on several artificial datasets and natural datasets for the task of clustering.


HortScience ◽  
2005 ◽  
Vol 40 (4) ◽  
pp. 1122B-1122 ◽  
Author(s):  
Peter Boches ◽  
Lisa J. Rowland ◽  
Kim Hummer ◽  
Nahla V. Bassil

Microsatellite markers for blueberry (Vaccinium L.) were created from a preexisting blueberry expressed sequence tag (EST) library of 1305 sequences and a microsatellite-enriched genomic library of 136 clones. Microsatellite primers for 65 EST-containing simple sequence repeats (SSRs) and 29 genomic SSR were initially tested for amplification and polymorphism on agarose gels. Potential usefulness of these SSRs for estimating species relationships in the genus was assessed through cross-species transference of 45 SSR loci and cluster analysis using genetic distance values from five highly polymorphic EST-SSR loci. Cross-species amplification for 45 SSR loci ranged from 17% to 100%, and was 83% on average in nine sections. Cluster analysis of 59 Vaccinium species based on genetic distance measures obtained from 5 EST-SSR loci supported the concept of V. elliotii Chapm. as a genetically distinct diploid highbush species and indicated that V. ashei Reade is of hybrid origin. Twenty EST-SSR and 10 genomic microsatellite loci were used to determine genetic diversity in 72 tetraploid V. corymbosum L. accessions consisting mostly of common cultivars. Unique fingerprints were obtained for all accessions analyzed. Genetic relationships, based on microsatellites, corresponded well with known pedigree information. Most modern cultivars clustered closely together, but southern highbush and northern highbush cultivars were sufficiently differentiated to form distinct clusters. Future use of microsatellites in Vaccinium will help resolve species relationships in the genus, estimate genetic diversity in the National Clonal Germplasm Repository (NCGR) collection, and confirm the identity of clonal germplasm accessions.


2021 ◽  
Vol 56 (3) ◽  
pp. 157-168
Author(s):  
Adji Achmad Rinaldo Fernandes ◽  
Solimun ◽  
Nurjannah ◽  
Usfi Al Imama Billah ◽  
Ni Made Ayu Astari Badung

This study wants to compare the Integrated Cluster Analysis and SEM model of the Warp-PLS approach with various cluster validity indices and distance measures on Service Quality, Environment, Fashions, Willingness to Pay, and Compliant Paying Behavior of Bank X Customers. The data used in this study are primary. The variables used in this study are service quality, environment, fashion, willingness to pay, and compliance with paying behavior at Bank X. The data were obtained through a questionnaire with a Likert scale — measurement of variables in primary data using the average score of each item. The sampling technique used was purposive sampling. The object of observation is the customer as many as 100 respondents. Data analysis was carried out quantitatively, and a descriptive analysis was carried out first. An Integrated Cluster Analysis and SEM analysis of the Warp-PLS approach was carried out with the average linkage method on various cluster validity indices and three distance measures. The Warp-PLS approach's integrated cluster and SEM model with the Gap Index, Index C, Global Sillhouette, and Goodman-Kruskal with the Manhattan Distance are better than the Gap, Index C, Global Sillhouette, and Goodman-Kruskal with the Euclidean and Minkowski Distance. The novelty in this research is the application of Integrated Cluster Analysis and SEM of the Warp-PLS approach to compare 4 cluster validity indices, namely Gap Index, C Index, Global Sillhouette, and Goodman-Kruskal, and three distance measures, namely Euclidean, Manhattan, and Minkowski distances simultaneously.


2016 ◽  
Vol 14 (1) ◽  
pp. 117-126 ◽  
Author(s):  
Kgwadi M. Mampana ◽  
Solly M. Seeletse ◽  
Enoch M. Sithole

Some problems cannot be solved optimally and compromises become necessary. In some cases obtaining an optimal solution may require combining algorithms and iterations. This often occurs when the problem is complex and a single procedure does not reach optimality. This paper shows a conglomerate of algorithms iterated in tasks to form an optimal consortium using cluster analysis. Hierarchical methods and distance measures lead the process. Few companies are desirable in optimal consortium formation. However, this study shows that optimization cannot be predetermined based on a specific fixed number of companies. The experiential exercise forms an optimal consortium of four companies from six shortlisted competitors


2020 ◽  
Vol 17 (1) ◽  
Author(s):  
Jana Cibulková ◽  
Zdenek Šulc ◽  
Hana Řezanková ◽  
Sergej Sirota

The paper focuses on similarity and distance measures for binary data and their application in cluster analysis. There are 66 measures for binary data analyzed in the paper in order to provide a comprehensive insight into the problematics and to create their well-arranged overview. For this purpose, formulas by which they were defined are studied. In the next part of the research, the results of object clustering on generated datasets are compared, and the ability of measures to create similar or identical clustering solutions is evaluated. This is done by using chosen internal and external evaluation criteria, and comparing the assignments of objects into clusters in the process of hierarchical clustering. The paper shows which similarity measures and distance measures for binary data lead to similar or even identical results in hierarchical cluster analysis.


2019 ◽  
Vol 8 (2) ◽  
pp. 161-170
Author(s):  
Milla Alifatun Nahdliyah ◽  
Tatik Widiharih ◽  
Alan Prahutama

The k-medoids method is a non-hierarchical clustering to classify n object into k clusters that have the same characteristics. This clustering algorithm uses the medoid as its cluster center. Medoid is the most centrally located object in a cluster, so it’s robust to outliers. In cluster analysis the objects are grouped by the similarity. To measure the similarity, it can be used distance measures, euclidean distance and cityblock distance. The distance that is used in cluster analysis can affect the clustering results. Then, to determine the quality of the clustering results can be used the internal criteria with silhouette width and C-index. In this research the k-medoids method to classify of regencies/cities in Central Java based on type and number of crimes. The optimal cluster at k= 4 use euclidean distance, where the silhouette index= 0,3862593 and C-index= 0,043893. Keywords: Clustering, k-Medoids, Euclidean distance, Cityblock distance, Silhouette index, C-index, Crime


2007 ◽  
pp. 165-174
Author(s):  
Sándor Kovács ◽  
Péter Balogh

Cluster Analysis is one of the most favorite multivariable statistical methods, which is actually a special type of aggregating method. Observations are clustered by variables belonged to the observations. Our purpose is to create such clusters, in which the elements are the most similar, and between the clusters they are the most variant. For example these clusters could be the qualitative classifications of farms.There have been several methods in Cluster Analysis as well as numerous distance measures, which could be used. In this article, we study all of these methods and measures. After we show the theoretical background, we apply the method in a given casestudy to control the qualitative classifications of experts. In this study, we use both the hierarchical and the non-hierarchical method, and also compare them. We would like to attract the attention that the most important problem of the analysis is to determine the optimal of clusters.


Sign in / Sign up

Export Citation Format

Share Document