scholarly journals CluSim: a Python package for the comparison of clusterings and dendrograms

2018 ◽  
Author(s):  
Alexander J. Gates ◽  
Yong-Yeol Ahn

SummaryQuantifying the similarity of clusterings is a fundamental step in data analysis. Clustering similarity is the basis for method evaluation, consensus clustering, and tracking the temporal evolution of clusters, among many other tasks. Here we provide CluSim, a comprehensive Python package for the comparison of partitions, overlapping clusterings, and hierarchical clusterings (dendrograms) with more than 20 similarity measures. The CluSim package provides both analytic and empirical methods for assessing the similarity of clusterings in the context of a random model, and provides the novel element-centric approaches for clustering similarity measure that we introduced recently. We illustrate the use of the package through two examples: an evaluation of the clustering of Gene Expression data in the context of different random models, and detailed analysis of model incongruence using element-centric comparisons between a set of phylogentic trees (dendrograms).Availability and implementationThe CluSim Python package and accompanying jupyter notebook is available at https://github.com/Hoosier-Clusters/clusim with the MIT open source [email protected] [email protected]


2017 ◽  
Author(s):  
Alexander J. Gates ◽  
Yong-Yeol Ahn

AbstractClustering is a central approach for unsupervised learning. After clustering is applied, the most fundamental analysis is to quantitatively compare clusterings. Such comparisons are crucial for the evaluation of clustering methods as well as other tasks such as consensus clustering. It is often argued that, in order to establish a baseline, clustering similarity should be assessed in the context of a random ensemble of clusterings. The prevailing assumption for the random clustering ensemble is the permutation model in which the number and sizes of clusters are fixed. However, this assumption does not necessarily hold in practice; for example, multiple runs of K-means clustering returns clusterings with a fixed number of clusters, while the cluster size distribution varies greatly. Here, we derive corrected variants of two clustering similarity measures (the Rand index and Mutual Information) in the context of two random clustering ensembles in which the number and sizes of clusters vary. In addition, we study the impact of one-sided comparisons in the scenario with a reference clustering. The consequences of different random models are illustrated using synthetic examples, handwriting recognition, and gene expression data. We demonstrate that the choice of random model can have a drastic impact on the ranking of similar clustering pairs, and the evaluation of a clustering method with respect to a random baseline; thus, the choice of random clustering model should be carefully justified.



2012 ◽  
Vol 10 (05) ◽  
pp. 1250011
Author(s):  
NATALIA NOVOSELOVA ◽  
IGOR TOM

Many external and internal validity measures have been proposed in order to estimate the number of clusters in gene expression data but as a rule they do not consider the analysis of the stability of the groupings produced by a clustering algorithm. Based on the approach assessing the predictive power or stability of a partitioning, we propose the new measure of cluster validation and the selection procedure to determine the suitable number of clusters. The validity measure is based on the estimation of the "clearness" of the consensus matrix, which is the result of a resampling clustering scheme or consensus clustering. According to the proposed selection procedure the stable clustering result is determined with the reference to the validity measure for the null hypothesis encoding for the absence of clusters. The final number of clusters is selected by analyzing the distance between the validity plots for initial and permutated data sets. We applied the selection procedure to estimate the clustering results on several datasets. As a result the proposed procedure produced an accurate and robust estimate of the number of clusters, which are in agreement with the biological knowledge and gold standards of cluster quality.



2019 ◽  
Vol 21 (5) ◽  
pp. 1818-1824 ◽  
Author(s):  
Qi Zhao ◽  
Yu Sun ◽  
Zekun Liu ◽  
Hongwan Zhang ◽  
Xingyang Li ◽  
...  

Abstract   Unsupervised clustering of high-throughput gene expression data is widely adopted for cancer subtyping. However, cancer subtypes derived from a single dataset are usually not applicable across multiple datasets from different platforms. Merging different datasets is necessary to determine accurate and applicable cancer subtypes but is still embarrassing due to the batch effect. CrossICC is an R package designed for the unsupervised clustering of gene expression data from multiple datasets/platforms without the requirement of batch effect adjustment. CrossICC utilizes an iterative strategy to derive the optimal gene signature and cluster numbers from a consensus similarity matrix generated by consensus clustering. This package also provides abundant functions to visualize the identified subtypes and evaluate subtyping performance. We expected that CrossICC could be used to discover the robust cancer subtypes with significant translational implications in personalized care for cancer patients. Availability and Implementation The package is implemented in R and available at GitHub (https://github.com/bioinformatist/CrossICC) and Bioconductor (http://bioconductor.org/packages/release/bioc/html/CrossICC.html) under the GPL v3 License.



2019 ◽  
Vol 25 (5) ◽  
pp. 934-950 ◽  
Author(s):  
Nguyen Xuan Thao ◽  
Truong Thi Thuy Duong

The selection of the target market plays vital role in promoting the marketing strategies of companies. We presented is a method for target market selection. We introduce some novel similarity measures between intuitionistic fuzzy sets and the novel similarity measures between interval-valued intuitionistic fuzzy sets. They are constructed by combining exponential and other functions. Finally, we introduce a multi-criteria decision making model to select target market by using the novel similarity measure of interval intuitionistic fuzzy sets.



2020 ◽  
Vol 2020 ◽  
pp. 1-25
Author(s):  
Tahir Mahmood ◽  
Ubaid Ur Rehman ◽  
Zeeshan Ali ◽  
Ronnason Chinram

Complex dual hesitant fuzzy set (CDHFS) is a combination of two modifications, called complex fuzzy set (CFS) and dual hesitant fuzzy set (DHFS). CDHFS makes two degrees, called membership valued and nonmembership valued in the form of a finite subset of a unit disc in the complex plane, and is a capable method to solve uncertain and unpredictable information in real-life problems. The goal of this study is to describe the notion of CDHFS and its operational laws. The novel approach of the complex interval-valued dual hesitant fuzzy set (CIvDHFS) and its fundamental laws are also described and defended with the help of an example. Further, the vector similarity measures (VSMs), weighted vector similarity measures (WVSMs), hybrid vector similarity measure, and weighted hybrid vector similarity measure are additionally explored. These similarity measures (SM) are applied to the environment of pattern recognition and medical diagnosis to assess the capability and feasibility of the interpreted measures. We additionally solved some numerical examples utilizing the established measures. We examine the dependability and validity of the proposed measures by comparing them with some existing measures. The advantages, comparative analysis, and graphical portrayal of the investigated interpreted measures and existing measures are additionally described in detail.



2014 ◽  
Vol 4 (4) ◽  
pp. 26-33 ◽  
Author(s):  
Selvamani Muthukalathi ◽  
Ravanan Ramanujam ◽  
Anbupalam Thalamuthu


Sign in / Sign up

Export Citation Format

Share Document