Conventional displays of structures in data compared with interactive projection-based clustering (IPBC)

AbstractClustering is an important task in knowledge discovery with the goal to identify structures of similar data points in a dataset. Here, the focus lies on methods that use a human-in-the-loop, i.e., incorporate user decisions into the clustering process through 2D and 3D displays of the structures in the data. Some of these interactive approaches fall into the category of visual analytics and emphasize the power of such displays to identify the structures interactively in various types of datasets or to verify the results of clustering algorithms. This work presents a new method called interactive projection-based clustering (IPBC). IPBC is an open-source and parameter-free method using a human-in-the-loop for an interactive 2.5D display and identification of structures in data based on the user’s choice of a dimensionality reduction method. The IPBC approach is systematically compared with accessible visual analytics methods for the display and identification of cluster structures using twelve clustering benchmark datasets and one additional natural dataset. Qualitative comparison of 2D, 2.5D and 3D displays of structures and empirical evaluation of the identified cluster structures show that IPBC outperforms comparable methods. Additionally, IPBC assists in identifying structures previously unknown to domain experts in an application.

Download Full-text

A state-of-the-art survey on semantic similarity for document clustering using GloVe and density-based algorithms

Indonesian Journal of Electrical Engineering and Computer Science ◽

10.11591/ijeecs.v22.i1.pp552-562 ◽

2021 ◽

Vol 22 (1) ◽

pp. 552

Author(s):

Shapol M. Mohammed ◽

Karwan Jacksi ◽

Subhi R. M. Zeebaree

Keyword(s):

Semantic Similarity ◽

State Of The Art ◽

Clustering Algorithms ◽

Document Clustering ◽

Accuracy Evaluation ◽

Similar Data ◽

Document Similarity ◽

Density Based Clustering ◽

Data Points ◽

The Common

<p><span>Semantic similarity is the process of identifying relevant data semantically. The traditional way of identifying document similarity is by using synonymous keywords and syntactician. In comparison, semantic similarity is to find similar data using meaning of words and semantics. Clustering is a concept of grouping objects that have the same features and properties as a cluster and separate from those objects that have different features and properties. In semantic document clustering, documents are clustered using semantic similarity techniques with similarity measurements. One of the common techniques to cluster documents is the density-based clustering algorithms using the density of data points as a main strategic to measure the similarity between them. In this paper, a state-of-the-art survey is presented to analyze the density-based algorithms for clustering documents. Furthermore, the similarity and evaluation measures are investigated with the selected algorithms to grasp the common ones. The delivered review revealed that the most used density-based algorithms in document clustering are DBSCAN and DPC. The most effective similarity measurement has been used with density-based algorithms, specifically DBSCAN and DPC, is Cosine similarity with F-measure for performance and accuracy evaluation.</span></p>

Download Full-text

Deep Clustering with Self-supervision using Pairwise Data Similarities

10.36227/techrxiv.14852652.v2 ◽

2021 ◽

Author(s):

Mohammadreza Sadeghi ◽

Narges Armanfard

Keyword(s):

Dimensional Space ◽

Second Phase ◽

Similar Data ◽

Number Of Clusters ◽

Latent Space ◽

Benchmark Datasets ◽

Data Points ◽

Complex Cluster ◽

Lower Dimensional Space ◽

Lower Dimensional

<div>Deep clustering incorporates embedding into clustering to find a lower-dimensional space appropriate for clustering. In this paper we propose a novel deep clustering framework with self-supervision using pairwise data similarities (DCSS). The proposed method consists of two successive phases. In the first phase we propose to form hypersphere-like groups of similar data points, i.e. one hypersphere per cluster, employing an autoencoder which is trained using cluster-specific losses. The hyper-spheres are formed in the autoencoder’s latent space. In the second phase, we propose to employ pairwise data similarities to create a K-dimensional space that is capable of accommodating more complex cluster distributions; hence, providing more accurate clustering performance. K is the number of clusters. The autoencoder’s latent space obtained in the first phase is used as the input of the second phase. Effectiveness of both phases are demonstrated on seven benchmark datasets through conducting a rigorous set of experiments.</div>

Download Full-text

Deep Clustering with Self-supervision using Pairwise Data Similarities

10.36227/techrxiv.14852652 ◽

2021 ◽

Author(s):

Mohammadreza Sadeghi ◽

Naeges Armanfard

Keyword(s):

Dimensional Space ◽

Second Phase ◽

Similar Data ◽

Clustering Methods ◽

Latent Space ◽

Benchmark Datasets ◽

Data Points ◽

Common Group ◽

Fully Connected ◽

Group Center

Deep clustering incorporates embedding into clustering to find a lower-dimensional space appropriate for clustering. Most of the existing methods try to group similar data points through simultaneously minimizing clustering and reconstruction losses, employing an autoencoder (AE). However, they all ignore the relevant useful information available within pairwise data relationships. In this paper we propose a novel deep clustering framework with self-supervision using pairwise data similarities (DCSS). The proposed method consists of two successive phases. First, we propose a novel AE-based approach that aims to aggregate similar data points near a common group center in the latent space of an AE. The AE's latent space is obtained by minimizing weighted reconstruction and centering losses of data points, where weights are defined based on similarity of data points and group centers. In the second phase, we map the AE's latent space, using a fully connected network MNet, onto a K-dimensional space used to derive the final data cluster assignments, where K is the number of clusters. MNet is trained to strengthen (weaken) similarity of similar (dissimilar) samples. Experimental results on multiple benchmark datasets demonstrate the effectiveness of DCSS for data clustering and as a general framework for boosting up state-of-the-art clustering methods.

Download Full-text

ThreadStates: State-based Visual Analysis of Disease Progression

10.31219/osf.io/vcskm ◽

2021 ◽

Author(s):

Qianwen Wang ◽

Tali Mazor ◽

Theresa A Harbig ◽

Ethan Cerami ◽

Nils Gehlenborg

Keyword(s):

Disease Progression ◽

Visual Analytics ◽

Visual Analysis ◽

State Transitions ◽

Observation Data ◽

Domain Experts ◽

Human In The Loop ◽

Disease States ◽

Patient Groups ◽

Progression Patterns

A growing number of longitudinal cohort studies are generating data with extensive patient observations across multiple timepoints. Such data offers promising opportunities to better understand the progression of diseases. However, most existing visual analysis tools of health records are aimed at general event sequences and little attention has been paid to common types of clinical data that contain extensive observations. To fill this gap, we designed and implemented ThreadStates, an interactive visual analytics tool for the exploration of longitudinal patient cohort data. The focus of ThreadStates is to identify the states of disease progression by learning from observation data in a human-in-the-loop manner. We propose a novel matrix+glyph design and combine it with a scatter plot to enable seamless identification, observation, and refinement of states. The disease progression patterns are then revealed in terms of state transitions using Sankey-based visualizations. We employ sequence clustering techniques to find patient groups with distinctive progression patterns, and to reveal the association between disease progression and patient-level features. The design and development were driven by a requirement analysis and iteratively refined based on feedback from domain experts over the course of a 10-month design study. Case studies and expert interviews demonstrate that ThreadStates can successively summarize disease states, reveal disease progression, and compare patient groups.

Download Full-text

Clustering Algorithms: An Exploratory Review

10.5772/intechopen.100376 ◽

2021 ◽

Author(s):

R.S.M. Lakshmi Patibandla ◽

Veeranjaneyulu N

Keyword(s):

Standard Deviation ◽

Evolutionary Algorithms ◽

Root Mean Square ◽

Optimization Problems ◽

Clustering Algorithms ◽

Similar Data ◽

Mean Square ◽

Data Set ◽

Validation Criteria ◽

Data Points

A process of similar data items into groups is called data clustering. Partitioning a Data Set into some groups based on the resemblance within a group by using various algorithms. Partition Based algorithms key idea is to split the data points into partitions and each one replicates one cluster. The performance of partition depends on certain objective functions. Evolutionary algorithms are used for the evolution of social aspects and to provide optimum solutions for huge optimization problems. In this paper, a survey of various partitioning and evolutionary algorithms can be implemented on a benchmark dataset and proposed to apply some validation criteria methods such as Root-Mean-Square Standard Deviation, R-square and SSD, etc., on some algorithms like Leader, ISODATA, SGO and PSO, and so on.

Download Full-text

Deep Clustering with Self-supervision using Pairwise Data Similarities

10.36227/techrxiv.14852652.v1 ◽

2021 ◽

Author(s):

Mohammadreza Sadeghi ◽

Naeges Armanfard

Keyword(s):

Dimensional Space ◽

Second Phase ◽

Similar Data ◽

Clustering Methods ◽

Latent Space ◽

Benchmark Datasets ◽

Data Points ◽

Common Group ◽

Fully Connected ◽

Group Center

Download Full-text

Using single-cell cytometry to illustrate integrated multi-perspective evaluation of clustering algorithms using Pareto fronts

Bioinformatics ◽

10.1093/bioinformatics/btab038 ◽

2021 ◽

Author(s):

Givanna H Putri ◽

Irena Koprinska ◽

Thomas M Ashhurst ◽

Nicholas J C King ◽

Mark N Read

Keyword(s):

Single Cell ◽

Performance Metrics ◽

Clustering Algorithms ◽

Latin Hypercube Sampling ◽

Supplementary Information ◽

Sequencing Data ◽

Evaluation Protocol ◽

Benchmark Datasets ◽

Pareto Fronts ◽

Parameter Values

Abstract Motivation Many ‘automated gating’ algorithms now exist to cluster cytometry and single-cell sequencing data into discrete populations. Comparative algorithm evaluations on benchmark datasets rely either on a single performance metric, or a few metrics considered independently of one another. However, single metrics emphasize different aspects of clustering performance and do not rank clustering solutions in the same order. This underlies the lack of consensus between comparative studies regarding optimal clustering algorithms and undermines the translatability of results onto other non-benchmark datasets. Results We propose the Pareto fronts framework as an integrative evaluation protocol, wherein individual metrics are instead leveraged as complementary perspectives. Judged superior are algorithms that provide the best trade-off between the multiple metrics considered simultaneously. This yields a more comprehensive and complete view of clustering performance. Moreover, by broadly and systematically sampling algorithm parameter values using the Latin Hypercube sampling method, our evaluation protocol minimizes (un)fortunate parameter value selections as confounding factors. Furthermore, it reveals how meticulously each algorithm must be tuned in order to obtain good results, vital knowledge for users with novel data. We exemplify the protocol by conducting a comparative study between three clustering algorithms (ChronoClust, FlowSOM and Phenograph) using four common performance metrics applied across four cytometry benchmark datasets. To our knowledge, this is the first time Pareto fronts have been used to evaluate the performance of clustering algorithms in any application domain. Availability and implementation Implementation of our Pareto front methodology and all scripts and datasets to reproduce this article are available at https://github.com/ghar1821/ParetoBench. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

A novel bidirectional clustering algorithm based on local density

Scientific Reports ◽

10.1038/s41598-021-93244-2 ◽

2021 ◽

Vol 11 (1) ◽

Author(s):

Baicheng Lyu ◽

Wenhua Wu ◽

Zhiqiang Hu

Keyword(s):

Clustering Algorithm ◽

Local Density ◽

Clustering Algorithms ◽

Cluster Number ◽

Denoising Method ◽

Number Of Clusters ◽

Data Points ◽

Cutoff Distance ◽

Large Clusters ◽

Small Clusters

AbstractWith the widely application of cluster analysis, the number of clusters is gradually increasing, as is the difficulty in selecting the judgment indicators of cluster numbers. Also, small clusters are crucial to discovering the extreme characteristics of data samples, but current clustering algorithms focus mainly on analyzing large clusters. In this paper, a bidirectional clustering algorithm based on local density (BCALoD) is proposed. BCALoD establishes the connection between data points based on local density, can automatically determine the number of clusters, is more sensitive to small clusters, and can reduce the adjusted parameters to a minimum. On the basis of the robustness of cluster number to noise, a denoising method suitable for BCALoD is proposed. Different cutoff distance and cutoff density are assigned to each data cluster, which results in improved clustering performance. Clustering ability of BCALoD is verified by randomly generated datasets and city light satellite images.

Download Full-text

TV-MV Analytics: A visual analytics framework to explore time-varying multivariate data

Information Visualization ◽

10.1177/1473871619858937 ◽

2019 ◽

Vol 19 (1) ◽

pp. 3-23

Author(s):

Aurea Soriano-Vargas ◽

Bernd Hamann ◽

Maria Cristina F de Oliveira

Keyword(s):

Visual Analytics ◽

Visual Analysis ◽

Multivariate Data ◽

Visual Exploration ◽

Data Sets ◽

Time Varying ◽

Domain Experts ◽

Data Mining Algorithms ◽

Temporal Relationships ◽

Visualization Techniques

We present an integrated interactive framework for the visual analysis of time-varying multivariate data sets. As part of our research, we performed in-depth studies concerning the applicability of visualization techniques to obtain valuable insights. We consolidated the considered analysis and visualization methods in one framework, called TV-MV Analytics. TV-MV Analytics effectively combines visualization and data mining algorithms providing the following capabilities: (1) visual exploration of multivariate data at different temporal scales, and (2) a hierarchical small multiples visualization combined with interactive clustering and multidimensional projection to detect temporal relationships in the data. We demonstrate the value of our framework for specific scenarios, by studying three use cases that were validated and discussed with domain experts.

Download Full-text

Individual Prediction Reliability Estimates in Classification and Regression

Intelligent Data Analysis for Real-Life Applications ◽

10.4018/978-1-4666-1806-0.ch003 ◽

2012 ◽

pp. 35-56

Author(s):

Darko Pevec ◽

Zoran Bosnic ◽

Igor Kononenko

Keyword(s):

Empirical Evaluation ◽

Cancer Recurrence ◽

Machine Learning Algorithms ◽

Reliability Estimation ◽

Local Error ◽

Research Areas ◽

Reliability Estimates ◽

Benchmark Datasets ◽

Classification And Regression ◽

Prediction Reliability

Current machine learning algorithms perform well in many problem domains, but in risk-sensitive decision making – for example, in medicine and finance – experts do not rely on common evaluation methods that provide overall assessments of models because such techniques do not provide any information about single predictions. This chapter summarizes the research areas that have motivated the development of various approaches to individual prediction reliability. Based on these motivations, the authors describe six approaches to reliability estimation: inverse transduction, local sensitivity analysis, bagging variance, local cross-validation, local error modelling, and density-based estimation. Empirical evaluation of the benchmark datasets provides promising results, especially for use with decision and regression trees. The testing results also reveal that the reliability estimators exhibit different performance levels when used with different models and in different domains. The authors show the usefulness of individual prediction reliability estimates in attempts to predict breast cancer recurrence. In this context, estimating prediction reliability for individual predictions is of crucial importance for physicians seeking to validate predictions derived using classification and regression models.

Download Full-text