scholarly journals Statistical properties of the single linkage hierarchical clustering estimator

2017 ◽  
Vol 185 ◽  
pp. 15-28 ◽  
Author(s):  
Dekang Zhu ◽  
Dan P. Guralnik ◽  
Xuezhi Wang ◽  
Xiang Li ◽  
Bill Moran
2017 ◽  
Vol 14 (1) ◽  
Author(s):  
Zdeněk Šulc ◽  
Martin Matějka ◽  
Jiří Procházka ◽  
Hana Řezanková

This paper thoroughly examines three recently introduced modifications of the Gower coefficient, which were determined for data with mixed-type variables in hierarchical clustering. On the contrary to the original Gower coefficient, which only recognizes if two categories match or not in the case of nominal variables, the examined modifications offer three different approaches to measuring the similarity between categories. The examined dissimilarity measures are compared and evaluated regarding the quality of their clusters measured by three internal indices (Dunn, silhouette, McClain) and regarding their classification abilities measured by the Rand index. The comparison is performed on 810 generated datasets. In the analysis, the performance of the similarity measures is evaluated by different data characteristics (the number of variables, the number of categories, the distance of clusters, etc.) and by different hierarchical clustering methods (average, complete, McQuitty and single linkage methods). As a result, two modifications are recommended for the use in practice.


2021 ◽  
Vol 15 (2) ◽  
pp. 63
Author(s):  
Desy Exasanti ◽  
Arief Jananto

Abstrak−Klasterisasi merupakan metode pengelompokan dari data yang sudah diketahui label kelasnya untuk menemukan klaster baru dari hasil observasi. Dalam klasterisasi banyak metode yaitu metode terpusat, hirarki, kepadatan dan berbasis kisi, namun dalam penelitian yang dilakukan ini dipilih metode berbasis hirarki. Metode hirarki ini bekerja melakukan pengelompokan objek dengan membentuk hirarki klaster namun bukan berarti selalu digambarkan dengan hirarki dalam organsasi. Dipilihnya Agglomerative Hierarchical Clustering dimana merupakan jenis dari bawah ke atas atau biasa disebut (bottom-up) dalam metode ini objek yang akan diuji dianggap sebagai objek tunggal sebagai klaster dan lalu dilakukan iterasi untuk menemukan klaster-klaster yang lebih besar. Data yang akan digunakan adalah data non-kebakaran pada Dinas Pemadam Kebakaran Kota Semarang ynng mana akan dilakukan pengelompokan wilayah penanganan non-kebakaran. Dinas Pemadam Kebakaran melakukan penanganan bukan hanya kebakaran saja namun ada banyak hal yang sebenarnya dapat ditangani oleh petugas pemadam kebakaran, kejadian non-kebakaran ada beberapa seperti evakuasi reptil, evakuasi kucing, penyelamatan korban kecelakaan dan lain sebagainya. Dari data non-kebakaran dari 16 kecamatan di Kota Semarang pada tahun 2019 akan dilakukan uji menggunakan tiga algoritma yaitu Single Lingkage, Average Linkage dan Complete Linkage . Adapun dari algoritma Single Linkage dilakukan prosedur pemusatan dari jarak terkecil antar objek data, algoritma Average Linkage dilakukan prosedur dari jarak rata-rata objek data, sedangkan jika algoritma Complete Linkage dilakukan prosedur pemusatan dari jarak yang terbesar. Implementasi dan visualiasi dari data uji coba yang dilakukan di penilitian ini menggunakan tools WEKA 3.8.4, Wakaito Environment Analysis for Knowledge atau yang biasa dikenal dengan WEKA ini merupakan software yang menggunakan bahasa pemrograman java. Dari dataset 380 data diambil sampel 100 data untuk diuji mengunakan WEKA menggunakan metode perhtungan jarak Manhattan Distance dengan 3 cluster. Hasil dari data uji coba dapat divisualisasikan dengan visualisasi dendogram pada fitur visualize tree  dan jika dilakukan visualisasi dalam bentuk grafik dapat dilakukan menggunakan fitur visualize clusters assignment.


Author(s):  
Satoshi Takumi ◽  
◽  
Sadaaki Miyamoto

The aim of this paper is to study methods of twofold membership clustering using the nearest prototype and nearest neighbor. The former uses theK-means, whereas the latter extends the single linkage in agglomerative hierarchical clustering. The concept of inductive clustering is moreover used for the both methods, which means that natural classification rules are derived as the results of clustering, a typical example of which is the Voronoi regions inK-means clustering. When the rule of nearest prototype allocation inK-means is replaced by nearest neighbor classification, we have inductive clustering related to the single linkage in agglomerative hierarchical clustering. The former method usesK-means or fuzzyc-means with noise clusters, whereby twofold memberships are derived; the latter method also derives two memberships in a different manner. Theoretical properties of the both methods are studied. Illustrative examples show implications and significances of this concept.


2016 ◽  
Vol 33 (1) ◽  
pp. 118-140 ◽  
Author(s):  
Alvaro Martínez-Pérez

2020 ◽  
Vol 6 (8) ◽  
Author(s):  
Martinique Frentrup ◽  
Zhemin Zhou ◽  
Matthias Steglich ◽  
Jan P. Meier-Kolthoff ◽  
Markus Göker ◽  
...  

Clostridioides difficile is the primary infectious cause of antibiotic-associated diarrhea. Local transmissions and international outbreaks of this pathogen have been previously elucidated by bacterial whole-genome sequencing, but comparative genomic analyses at the global scale were hampered by the lack of specific bioinformatic tools. Here we introduce a publicly accessible database within EnteroBase (http://enterobase.warwick.ac.uk) that automatically retrieves and assembles C. difficile short-reads from the public domain, and calls alleles for core-genome multilocus sequence typing (cgMLST). We demonstrate that comparable levels of resolution and precision are attained by EnteroBase cgMLST and single-nucleotide polymorphism analysis. EnteroBase currently contains 18 254 quality-controlled C. difficile genomes, which have been assigned to hierarchical sets of single-linkage clusters by cgMLST distances. This hierarchical clustering is used to identify and name populations of C. difficile at all epidemiological levels, from recent transmission chains through to epidemic and endemic strains. Moreover, it puts newly collected isolates into phylogenetic and epidemiological context by identifying related strains among all previously published genome data. For example, HC2 clusters (i.e. chains of genomes with pairwise distances of up to two cgMLST alleles) were statistically associated with specific hospitals (P<10−4) or single wards (P=0.01) within hospitals, indicating they represented local transmission clusters. We also detected several HC2 clusters spanning more than one hospital that by retrospective epidemiological analysis were confirmed to be associated with inter-hospital patient transfers. In contrast, clustering at level HC150 correlated with k-mer-based classification and was largely compatible with PCR ribotyping, thus enabling comparisons to earlier surveillance data. EnteroBase enables contextual interpretation of a growing collection of assembled, quality-controlled C. difficile genome sequences and their associated metadata. Hierarchical clustering rapidly identifies database entries that are related at multiple levels of genetic distance, facilitating communication among researchers, clinicians and public-health officials who are combatting disease caused by C. difficile .


Author(s):  
Annisa Annisa ◽  
Yuda Munarko ◽  
Yufis Azhar

Trending topic is a feature provided by twitter that informs something widely discussed by users in a particular time. The form of a trending topic is a hashtag and can be selected by clicking. However, the number of tweets for each trending topics can be very large, so it will be difficult if we want to know all the contents. So, in order to make easy when reading the topic, a small number of tweets can be selected as the main idea of the topic. In this study, we applied the Agglomerative Single Linkage Hierarchical Clustering by calculating the TF-IDF value for each word in advance. We used 100 trending topics, where each topic consists of 50 tweets in Indonesian. For testing, we provided 30 trending topics which consist of 2 until 9 sub-topics. The result is that each trending topics can be summarized into shorter text contains 2 until 9 tweets. We were able to summarize 1 trending topics exactly same as the topic summarized by human expert. However, the rest of topics corresponded partially with human expert. 


Sign in / Sign up

Export Citation Format

Share Document