CluSim: a Python package for the comparison of clusterings and dendrograms

Mapping Intimacies ◽

10.1101/410084 ◽

2018 ◽

Author(s):

Alexander J. Gates ◽

Yong-Yeol Ahn

Keyword(s):

Temporal Evolution ◽

Similarity Measures ◽

Consensus Clustering ◽

The Novel ◽

Expression Data ◽

Empirical Methods ◽

Method Evaluation ◽

Random Models ◽

Python Package ◽

Clustering Similarity

SummaryQuantifying the similarity of clusterings is a fundamental step in data analysis. Clustering similarity is the basis for method evaluation, consensus clustering, and tracking the temporal evolution of clusters, among many other tasks. Here we provide CluSim, a comprehensive Python package for the comparison of partitions, overlapping clusterings, and hierarchical clusterings (dendrograms) with more than 20 similarity measures. The CluSim package provides both analytic and empirical methods for assessing the similarity of clusterings in the context of a random model, and provides the novel element-centric approaches for clustering similarity measure that we introduced recently. We illustrate the use of the package through two examples: an evaluation of the clustering of Gene Expression data in the context of different random models, and detailed analysis of model incongruence using element-centric comparisons between a set of phylogentic trees (dendrograms).Availability and implementationThe CluSim Python package and accompanying jupyter notebook is available at https://github.com/Hoosier-Clusters/clusim with the MIT open source [email protected] [email protected]

The Impact of Random Models on Clustering Similarity

10.1101/196840 ◽

2017 ◽

Cited By ~ 4

Author(s):

Alexander J. Gates ◽

Yong-Yeol Ahn

Keyword(s):

Handwriting Recognition ◽

Similarity Measures ◽

Fixed Number ◽

Consensus Clustering ◽

Clustering Methods ◽

Clustering Model ◽

Clustering Ensembles ◽

Random Models ◽

The Impact ◽

Clustering Similarity

AbstractClustering is a central approach for unsupervised learning. After clustering is applied, the most fundamental analysis is to quantitatively compare clusterings. Such comparisons are crucial for the evaluation of clustering methods as well as other tasks such as consensus clustering. It is often argued that, in order to establish a baseline, clustering similarity should be assessed in the context of a random ensemble of clusterings. The prevailing assumption for the random clustering ensemble is the permutation model in which the number and sizes of clusters are fixed. However, this assumption does not necessarily hold in practice; for example, multiple runs of K-means clustering returns clusterings with a fixed number of clusters, while the cluster size distribution varies greatly. Here, we derive corrected variants of two clustering similarity measures (the Rand index and Mutual Information) in the context of two random clustering ensembles in which the number and sizes of clusters vary. In addition, we study the impact of one-sided comparisons in the scenario with a reference clustering. The consequences of different random models are illustrated using synthetic examples, handwriting recognition, and gene expression data. We demonstrate that the choice of random model can have a drastic impact on the ranking of similar clustering pairs, and the evaluation of a clustering method with respect to a random baseline; thus, the choice of random clustering model should be carefully justified.

ENTROPY-BASED CLUSTER VALIDATION AND ESTIMATION OF THE NUMBER OF CLUSTERS IN GENE EXPRESSION DATA

Journal of Bioinformatics and Computational Biology ◽

10.1142/s0219720012500114 ◽

2012 ◽

Vol 10 (05) ◽

pp. 1250011

Author(s):

NATALIA NOVOSELOVA ◽

IGOR TOM

Keyword(s):

Gene Expression ◽

Gene Expression Data ◽

Clustering Algorithm ◽

Selection Procedure ◽

Biological Knowledge ◽

Consensus Clustering ◽

Expression Data ◽

Cluster Validation ◽

Number Of Clusters ◽

Validity Measure

Many external and internal validity measures have been proposed in order to estimate the number of clusters in gene expression data but as a rule they do not consider the analysis of the stability of the groupings produced by a clustering algorithm. Based on the approach assessing the predictive power or stability of a partitioning, we propose the new measure of cluster validation and the selection procedure to determine the suitable number of clusters. The validity measure is based on the estimation of the "clearness" of the consensus matrix, which is the result of a resampling clustering scheme or consensus clustering. According to the proposed selection procedure the stable clustering result is determined with the reference to the validity measure for the null hypothesis encoding for the absence of clusters. The final number of clusters is selected by analyzing the distance between the validity plots for initial and permutated data sets. We applied the selection procedure to estimate the clustering results on several datasets. As a result the proposed procedure produced an accurate and robust estimate of the number of clusters, which are in agreement with the biological knowledge and gold standards of cluster quality.

CrossICC: iterative consensus clustering of cross-platform gene expression data without adjusting batch effect

Briefings in Bioinformatics ◽

10.1093/bib/bbz116 ◽

2019 ◽

Vol 21 (5) ◽

pp. 1818-1824 ◽

Cited By ~ 1

Author(s):

Qi Zhao ◽

Yu Sun ◽

Zekun Liu ◽

Hongwan Zhang ◽

Xingyang Li ◽

...

Keyword(s):

Gene Expression ◽

Gene Expression Data ◽

Gene Signature ◽

Unsupervised Clustering ◽

Batch Effect ◽

Consensus Clustering ◽

Expression Data ◽

Personalized Care ◽

Cancer Subtypes ◽

Multiple Datasets

Abstract Unsupervised clustering of high-throughput gene expression data is widely adopted for cancer subtyping. However, cancer subtypes derived from a single dataset are usually not applicable across multiple datasets from different platforms. Merging different datasets is necessary to determine accurate and applicable cancer subtypes but is still embarrassing due to the batch effect. CrossICC is an R package designed for the unsupervised clustering of gene expression data from multiple datasets/platforms without the requirement of batch effect adjustment. CrossICC utilizes an iterative strategy to derive the optimal gene signature and cluster numbers from a consensus similarity matrix generated by consensus clustering. This package also provides abundant functions to visualize the identified subtypes and evaluate subtyping performance. We expected that CrossICC could be used to discover the robust cancer subtypes with significant translational implications in personalized care for cancer patients. Availability and Implementation The package is implemented in R and available at GitHub (https://github.com/bioinformatist/CrossICC) and Bioconductor (http://bioconductor.org/packages/release/bioc/html/CrossICC.html) under the GPL v3 License.

Semi-supervised consensus clustering for gene expression data analysis

BioData Mining ◽

10.1186/1756-0381-7-7 ◽

2014 ◽

Vol 7 (1) ◽

Cited By ~ 17

Author(s):

Yunli Wang ◽

Youlian Pan

Keyword(s):

Gene Expression ◽

Data Analysis ◽

Gene Expression Data ◽

Consensus Clustering ◽

Expression Data ◽

Gene Expression Data Analysis

Consensus Clustering for Cancer Gene Expression Data - Large-Scale Analysis using Evidence Accumulation Approach

Proceedings of the 10th International Joint Conference on Biomedical Engineering Systems and Technologies ◽

10.5220/0006174501760183 ◽

2017 ◽

Author(s):

Isidora Šašić ◽

Sanja Brdar ◽

Tatjana Lončar-Turukalo ◽

Helena Aidos ◽

Ana Fred

Keyword(s):

Gene Expression ◽

Gene Expression Data ◽

Large Scale ◽

Consensus Clustering ◽

Cancer Gene ◽

Expression Data ◽

Scale Analysis ◽

Evidence Accumulation ◽

Large Scale Analysis

SELECTING TARGET MARKET BY SIMILAR MEASURES IN INTERVAL INTUITIONISTIC FUZZY SET

Technological and Economic Development of Economy ◽

10.3846/tede.2019.10290 ◽

2019 ◽

Vol 25 (5) ◽

pp. 934-950 ◽

Cited By ~ 5

Author(s):

Nguyen Xuan Thao ◽

Truong Thi Thuy Duong

Keyword(s):

Fuzzy Sets ◽

Intuitionistic Fuzzy Set ◽

Similarity Measures ◽

Vital Role ◽

Intuitionistic Fuzzy Sets ◽

Target Market ◽

The Novel ◽

Market Selection ◽

Intuitionistic Fuzzy ◽

Decision Making Model

The selection of the target market plays vital role in promoting the marketing strategies of companies. We presented is a method for target market selection. We introduce some novel similarity measures between intuitionistic fuzzy sets and the novel similarity measures between interval-valued intuitionistic fuzzy sets. They are constructed by combining exponential and other functions. Finally, we introduce a multi-criteria decision making model to select target market by using the novel similarity measure of interval intuitionistic fuzzy sets.

Jaccard and Dice Similarity Measures Based on Novel Complex Dual Hesitant Fuzzy Sets and Their Applications

Mathematical Problems in Engineering ◽

10.1155/2020/5920432 ◽

2020 ◽

Vol 2020 ◽

pp. 1-25

Author(s):

Tahir Mahmood ◽

Ubaid Ur Rehman ◽

Zeeshan Ali ◽

Ronnason Chinram

Keyword(s):

Similarity Measure ◽

Fuzzy Set ◽

Real Life ◽

Similarity Measures ◽

Hesitant Fuzzy Set ◽

The Novel ◽

Dual Hesitant Fuzzy Set ◽

Life Problems ◽

Novel Approach ◽

Operational Laws

Complex dual hesitant fuzzy set (CDHFS) is a combination of two modifications, called complex fuzzy set (CFS) and dual hesitant fuzzy set (DHFS). CDHFS makes two degrees, called membership valued and nonmembership valued in the form of a finite subset of a unit disc in the complex plane, and is a capable method to solve uncertain and unpredictable information in real-life problems. The goal of this study is to describe the notion of CDHFS and its operational laws. The novel approach of the complex interval-valued dual hesitant fuzzy set (CIvDHFS) and its fundamental laws are also described and defended with the help of an example. Further, the vector similarity measures (VSMs), weighted vector similarity measures (WVSMs), hybrid vector similarity measure, and weighted hybrid vector similarity measure are additionally explored. These similarity measures (SM) are applied to the environment of pattern recognition and medical diagnosis to assess the capability and feasibility of the interpreted measures. We additionally solved some numerical examples utilizing the established measures. We examine the dependability and validity of the proposed measures by comparing them with some existing measures. The advantages, comparative analysis, and graphical portrayal of the investigated interpreted measures and existing measures are additionally described in detail.

Association Rule Based Similarity Measures for the Clustering of Gene Expression Data

The Open Medical Informatics Journal ◽

10.2174/1874325001004010063 ◽

2010 ◽

Vol 4 (1) ◽

pp. 63-73 ◽

Cited By ~ 2

Author(s):

Prerna Sethi

Keyword(s):

Gene Expression ◽

Gene Expression Data ◽

Association Rule ◽

Similarity Measures ◽

Expression Data ◽

Rule Based

pySAPC, a python package for sparse affinity propagation clustering: Application to odontogenesis whole genome time series gene-expression data

Biochimica et Biophysica Acta (BBA) - General Subjects ◽

10.1016/j.bbagen.2016.06.008 ◽

2016 ◽

Vol 1860 (11) ◽

pp. 2613-2618 ◽

Cited By ~ 2

Author(s):

Huojun Cao ◽

Brad A. Amendt

Keyword(s):

Gene Expression ◽

Time Series ◽

Gene Expression Data ◽

Affinity Propagation ◽

Whole Genome ◽

Expression Data ◽

Affinity Propagation Clustering ◽

Time Series Gene Expression ◽

Python Package

Consensus Clustering for Microarray Gene Expression Data

Bonfring International Journal of Data Mining ◽

10.9756/bijdm.6140 ◽

2014 ◽

Vol 4 (4) ◽

pp. 26-33 ◽

Cited By ~ 2

Author(s):

Selvamani Muthukalathi ◽

Ravanan Ramanujam ◽

Anbupalam Thalamuthu

Keyword(s):

Gene Expression ◽

Gene Expression Data ◽

Microarray Gene Expression Data ◽

Consensus Clustering ◽

Expression Data ◽

Microarray Gene Expression ◽

Microarray Gene