On the Persistence of Clustering Solutions and True Number of Clusters in a Dataset

Typically clustering algorithms provide clustering solutions with prespecified number of clusters. The lack of a priori knowledge on the true number of underlying clusters in the dataset makes it important to have a metric to compare the clustering solutions with different number of clusters. This article quantifies a notion of persistence of clustering solutions that enables comparing solutions with different number of clusters. The persistence relates to the range of dataresolution scales over which a clustering solution persists; it is quantified in terms of the maximum over two-norms of all the associated cluster-covariance matrices. Thus we associate a persistence value for each element in a set of clustering solutions with different number of clusters. We show that the datasets where natural clusters are a priori known, the clustering solutions that identify the natural clusters are most persistent - in this way, this notion can be used to identify solutions with true number of clusters. Detailed experiments on a variety of standard and synthetic datasets demonstrate that the proposed persistence-based indicator outperforms the existing approaches, such as, gap-statistic method, X-means, Gmeans, PG-means, dip-means algorithms and informationtheoretic method, in accurately identifying the clustering solutions with true number of clusters. Interestingly, our method can be explained in terms of the phase-transition phenomenon in the deterministic annealing algorithm, where the number of distinct cluster centers changes (bifurcates) with respect to an annealing parameter.

Download Full-text

How to Group Genes according to Expression Profiles?

International Journal of Plant Genomics ◽

10.1155/2011/261975 ◽

2011 ◽

Vol 2011 ◽

pp. 1-10

Author(s):

Julio A. Di Rienzo ◽

Silvia G. Valdano ◽

Paula Fernández

Keyword(s):

Expression Profiles ◽

Clustering Algorithms ◽

Testing Hypothesis ◽

Data Set ◽

Number Of Clusters ◽

Significance Level ◽

True Number ◽

Common Response ◽

Response Profile

The most commonly applied strategies for identifying genes with a common response profile are based on clustering algorithms. These methods have no explicit rules to define the appropriate number of groups of genes. Usually the number of clusters is decided on heuristic criteria or through the application of different methods proposed to assess the number of clusters in a data set. The purpose of this paper is to compare the performance of seven of these techniques, including traditional ones, and some recently proposed. All of them produce underestimations of the true number of clusters. However, within this limitation, the gDGC algorithm appears to be the best. It is the only one that explicitly states a rule for cutting a dendrogram on the basis of a testing hypothesis framework, allowing the user to calibrate the sensitivity, adjusting the significance level.

Download Full-text

A Fuzzy Graph Framework for Initializing k-Means

International Journal of Artificial Intelligence Tools ◽

10.1142/s0218213016500317 ◽

2016 ◽

Vol 25 (06) ◽

pp. 1650031 ◽

Cited By ~ 4

Author(s):

Georgios Drakopoulos ◽

Panagiotis Gourgaris ◽

Andreas Kanavos ◽

Christos Makris

Keyword(s):

Web Mining ◽

Clustering Algorithms ◽

Low Complexity ◽

Design Parameters ◽

Fuzzy Graph ◽

Community Discovery ◽

Number Of Clusters ◽

Overlapping Communities ◽

Underlying Space ◽

True Number

k-Means is among the most significant clustering algorithms for vectors chosen from an underlying space S. Its applications span a broad range of fields including machine learning, image and signal processing, and Web mining. Since the introduction of k-Means, two of its major design parameters remain open to research. The first is the number of clusters to be formed and the second is the initial vectors. The latter is also inherently related to selecting a density measure for S. This article presents a two-step framework for estimating both parameters. First, the underlying vector space is represented as a fuzzy graph. Afterwards, two algorithms for partitioning a fuzzy graph to non-overlapping communities, namely Fuzzy Walktrap and Fuzzy Newman-Girvan, are executed. The former is a low complexity evolving heuristic, whereas the latter is deterministic and combines a graph communication metric with an exhaustive search principle. Once communities are discovered, their number is taken as an estimate of the true number of clusters. The initial centroids or seeds are subsequently selected based on the density of S. The proposed framework is modular, allowing thus more initialization schemes to be derived. The secondary contributions of this article are HI, a similarity metric for vectors with numerical and categorical entries and the assessment of its stochastic behavior, and TD, a metric for assessing cluster confusion. The aforementioned framework has been implemented mainly in C# and partially in C++ and its performance in terms of efficiency, accuracy, and cluster confusion was experimentally assessed. Post-processing results conducted with MATLAB indicate that the evolving community discovery algorithm approaches the performance of its deterministic counterpart with considerably less complexity.

Download Full-text

Determination of the Number of Clusters in a Data Set

Management Theories and Strategic Practices for Decision Making ◽

10.4018/978-1-4666-2473-3.ch004 ◽

2012 ◽

pp. 59-73

Author(s):

Derrick S. Boone

Keyword(s):

Clustering Algorithm ◽

Clustering Algorithms ◽

Monte Carlo Study ◽

Stopping Rules ◽

Data Sets ◽

Data Set ◽

Number Of Clusters ◽

True Number ◽

Clustering Criterion

The accuracy of “stopping rules” for determining the number of clusters in a data set is examined as a function of the underlying clustering algorithm being used. Using a Monte Carlo study, various stopping rules, used in conjunction with six clustering algorithms, are compared to determine which rule/algorithm combinations best recover the true number of clusters. The rules and algorithms are tested using disparately sized, artificially generated data sets that contained multiple numbers and levels of clusters, variables, noise, outliers, and elongated and unequally sized clusters. The results indicate that stopping rule accuracy depends on the underlying clustering algorithm being used. The cubic clustering criterion (CCC), when used in conjunction with mixture models or Ward’s method, recovers the true number of clusters more accurately than other rules and algorithms. However, the CCC was more likely than other stopping rules to report more clusters than are actually present. Implications are discussed.

Download Full-text

Estimating the number of clusters using diversity

Artificial Intelligence Research ◽

10.5430/air.v7n1p15 ◽

2017 ◽

Vol 7 (1) ◽

pp. 15 ◽

Cited By ~ 6

Author(s):

Suneel Kumar Kingrani ◽

Mark Levene ◽

Dell Zhang

Keyword(s):

Clustering Algorithms ◽

Ground Truth ◽

Optimal Number ◽

Number Of Clusters ◽

Quadratic Entropy ◽

Local Diversity ◽

Gap Statistic ◽

The Difference ◽

Real World Datasets ◽

Balanced Clustering

It is an important and challenging problem in unsupervised learning to estimate the number of clusters in a dataset. Knowing the number of clusters is a prerequisite for many commonly used clustering algorithms such as \textit{k}-means. In this paper, we propose a novel diversity based approach to this problem. Specifically, we show that the difference between the global diversity of clusters and the sum of each cluster’s local diversity of their members can be used as an effective indicator of the optimality of the number of clusters, where the diversity is measured by Rao’s quadratic entropy. A notable advantage of our proposed method is that it encourages balanced clustering by taking into account both the sizes of clusters and the distances between clusters. In other words, it is less prone to very small “outlier” clusters than existing methods. Our extensive experiments on both synthetic and real-world datasets (with known ground-truth clustering) have demonstrated that our proposed method is robust for clusters of different sizes, variances, and shapes, and it is more accurate than existing methods (including elbow, Caliński-Harabasz, silhouette, and gap-statistic) in terms of finding out the optimal number of clusters.

Download Full-text

Determination of the Number of Clusters in a Data Set

International Journal of Strategic Decision Sciences ◽

10.4018/jsds.2011100101 ◽

2011 ◽

Vol 2 (4) ◽

pp. 1-13 ◽

Cited By ~ 1

Author(s):

Derrick S. Boone

Keyword(s):

Clustering Algorithm ◽

Clustering Algorithms ◽

Monte Carlo Study ◽

Stopping Rules ◽

Data Sets ◽

Data Set ◽

Number Of Clusters ◽

True Number ◽

Clustering Criterion

Download Full-text

Cluster Tendency Assessment in Neuronal Spike Data

10.1101/285064 ◽

2018 ◽

Cited By ~ 2

Author(s):

Sara Mahallati ◽

James C. Bezdek ◽

Milos R. Popovic ◽

Taufik A. Valiante

Keyword(s):

Clustering Algorithm ◽

A Priori ◽

Clustering Algorithms ◽

Cluster Structure ◽

Extracellular Recording ◽

Ground Truth ◽

Visual Assessment ◽

Spike Sorting ◽

Number Of Clusters ◽

Probabilistic Clustering

AbstractSorting spikes from extracellular recording into clusters associated with distinct single units (putative neurons) is a fundamental step in analyzing neuronal populations. Such spike sorting is intrinsically unsupervised, as the number of neurons are not known a priori. Therefor, any spike sorting is an unsupervised learning problem that requires either of the two approaches: specification of a fixed value c for the number of clusters to seek, or generation of candidate partitions for several possible values of c, followed by selection of a best candidate based on various post-clustering validation criteria. In this paper, we investigate the first approach and evaluate the utility of several methods for providing lower dimensional visualization of the cluster structure and on subsequent spike clustering. We also introduce a visualization technique called improved visual assessment of cluster tendency (iVAT) to estimate possible cluster structures in data without the need for dimensionality reduction. Experimental results are conducted on two datasets with ground truth labels. In data with a relatively small number of clusters, iVAT is beneficial in estimating the number of clusters to inform the initialization of clustering algorithms. With larger numbers of clusters, iVAT gives a useful estimate of the coarse cluster structure but sometimes fails to indicate the presumptive number of clusters. We show that noise associated with recording extracellular neuronal potentials can disrupt computational clustering schemes, highlighting the benefit of probabilistic clustering models. Our results show that t-Distributed Stochastic Neighbor Embedding (t-SNE) provides representations of the data that yield more accurate visualization of potential cluster structure to inform the clustering stage. Moreover, The clusters obtained using t-SNE features were more reliable than the clusters obtained using the other methods, which indicates that t-SNE can potentially be used for both visualization and to extract features to be used by any clustering algorithm.

Download Full-text

A novel bidirectional clustering algorithm based on local density

Scientific Reports ◽

10.1038/s41598-021-93244-2 ◽

2021 ◽

Vol 11 (1) ◽

Author(s):

Baicheng Lyu ◽

Wenhua Wu ◽

Zhiqiang Hu

Keyword(s):

Clustering Algorithm ◽

Local Density ◽

Clustering Algorithms ◽

Cluster Number ◽

Denoising Method ◽

Number Of Clusters ◽

Data Points ◽

Cutoff Distance ◽

Large Clusters ◽

Small Clusters

AbstractWith the widely application of cluster analysis, the number of clusters is gradually increasing, as is the difficulty in selecting the judgment indicators of cluster numbers. Also, small clusters are crucial to discovering the extreme characteristics of data samples, but current clustering algorithms focus mainly on analyzing large clusters. In this paper, a bidirectional clustering algorithm based on local density (BCALoD) is proposed. BCALoD establishes the connection between data points based on local density, can automatically determine the number of clusters, is more sensitive to small clusters, and can reduce the adjusted parameters to a minimum. On the basis of the robustness of cluster number to noise, a denoising method suitable for BCALoD is proposed. Different cutoff distance and cutoff density are assigned to each data cluster, which results in improved clustering performance. Clustering ability of BCALoD is verified by randomly generated datasets and city light satellite images.

Download Full-text

A priori fixed covariance matrices of disturbance estimators

European Economic Review ◽

10.1016/0014-2921(72)90030-x ◽

1972 ◽

Vol 3 (4) ◽

pp. 413-436 ◽

Cited By ~ 5

Author(s):

C. Dubbelman

Keyword(s):

A Priori ◽

Covariance Matrices

Download Full-text

Towards Expert-Inspired Automatic Criterion to Cut a Dendrogram for Real-Industrial Applications

10.3233/faia210140 ◽

2021 ◽

Author(s):

Shikha Suman ◽

Ashutosh Karna ◽

Karina Gibert

Keyword(s):

Hierarchical Clustering ◽

Clustering Algorithms ◽

Computational Cost ◽

Real Life ◽

Ground Truth ◽

Industrial Applications ◽

Underlying Structure ◽

Cluster Validity ◽

Cluster Validity Index ◽

Number Of Clusters

Hierarchical clustering is one of the most preferred choices to understand the underlying structure of a dataset and defining typologies, with multiple applications in real life. Among the existing clustering algorithms, the hierarchical family is one of the most popular, as it permits to understand the inner structure of the dataset and find the number of clusters as an output, unlike popular methods, like k-means. One can adjust the granularity of final clustering to the goals of the analysis themselves. The number of clusters in a hierarchical method relies on the analysis of the resulting dendrogram itself. Experts have criteria to visually inspect the dendrogram and determine the number of clusters. Finding automatic criteria to imitate experts in this task is still an open problem. But, dependence on the expert to cut the tree represents a limitation in real applications like the fields industry 4.0 and additive manufacturing. This paper analyses several cluster validity indexes in the context of determining the suitable number of clusters in hierarchical clustering. A new Cluster Validity Index (CVI) is proposed such that it properly catches the implicit criteria used by experts when analyzing dendrograms. The proposal has been applied on a range of datasets and validated against experts ground-truth overcoming the results obtained by the State of the Art and also significantly reduces the computational cost.

Download Full-text

Performance Analysis of Computational Intelligence Correction

10.21203/rs.3.rs-247394/v1 ◽

2021 ◽

Author(s):

Nalini Arasavali ◽

Sasibhushanarao Gottapu

Keyword(s):

Kalman Filter ◽

Performance Analysis ◽

Computational Intelligence ◽

A Priori ◽

Covariance Matrices ◽

System Model ◽

Process Noise ◽

Navigation Algorithm ◽

Modified Kalman Filter ◽

Measurement Noises

Abstract Kalman ﬁlter (KF) is a widely used navigation algorithm, especially for precise positioning applications. However, the exact ﬁlter parameters must be deﬁned a priori to use standard Kalman ﬁlters for coping with low error values. But for the dynamic system model, the covariance of process noise is a priori entirely undeﬁned, which results in diﬃculties and challenges in the implementation of the conventional Kalman ﬁlter. Kalman Filter with recursive covariance estimation applied to solve those complicated functional issues, which can also be used in many other applications involving Kalaman ﬁltering technology, a modiﬁed Kalman ﬁlter called MKF-RCE. While this is a better approach, KF with SAR tuned covariance has been proposed to resolve the problem of estimation for the dynamic model. The data collected at (x: 706970.9093 m, y: 6035941.0226 m, z: 1930009.5821 m) used to illustrate the performance analysis of KF with recursive covariance and KF with computational intelligence correction by means of SAR (Search and Rescue) tuned covariance, when the covariance matrices of process and measurement noises are completely unknown in advance.

Download Full-text