UNSUPERVISED CLUSTERING USING FRACTAL DIMENSION

Clustering can be defined as the process of "grouping" a collection of objects into subsets or clusters. The clustering problem has been addressed in numerous contexts and by researchers in different disciplines. This reflects its broad appeal and usefulness as an exploratory data analysis approach. Unsupervised clustering algorithms have been developed to address real world problems in which the number of clusters present in the dataset is unknown. These algorithms approximate the number of clusters while performing the clustering procedure. This paper is a first step towards the development of unsupervised clustering algorithms capable of identifying clusters within clusters. To this end, an unsupervised clustering algorithm is modified so as to take into consideration the fractal dimension of the data. The experimental results indicate that this approach can provide further qualitative information compared to the unsupervised clustering algorithm.

Download Full-text

Stability-Based Validation of Clustering Solutions

Neural Computation ◽

10.1162/089976604773717621 ◽

2004 ◽

Vol 16 (6) ◽

pp. 1299-1323 ◽

Cited By ~ 248

Author(s):

Tilman Lange ◽

Volker Roth ◽

Mikio L. Braun ◽

Joachim M. Buhmann

Keyword(s):

Clustering Algorithm ◽

Group Structure ◽

Data Sets ◽

Expression Data ◽

Number Of Clusters ◽

Natural Group ◽

Exploratory Data ◽

Class Labels ◽

Validation Tool ◽

Real World Problems

Data clustering describes a set of frequently employed techniques in exploratory data analysis to extract “natural” group structure in data. Such groupings need to be validated to separate the signal in the data from spurious structure. In this context, finding an appropriate number of clusters is a particularly important model selection question. We introduce a measure of cluster stability to assess the validity of a cluster model. This stability measure quantifies the reproducibility of clustering solutions on a second sample, and it can be interpreted as a classification risk with regard to class labels produced by a clustering algorithm. The preferred number of clusters is determined by minimizing this classification risk as a function of the number of clusters. Convincing results are achieved on simulated as well as gene expression data sets. Comparisons to other methods demonstrate the competitive performance of our method and its suitability as a general validation tool for clustering solutions in real-world problems.

Download Full-text

Multi-Attribute Utility Theory Based K-Means Clustering Applications

International Journal of Data Warehousing and Mining ◽

10.4018/ijdwm.2017040101 ◽

2017 ◽

Vol 13 (2) ◽

pp. 1-12 ◽

Cited By ~ 2

Author(s):

Jungmok Ma

Keyword(s):

Cluster Analysis ◽

Utility Theory ◽

Clustering Algorithm ◽

Clustering Algorithms ◽

User Preferences ◽

Number Of Clusters ◽

Clustering Problem ◽

Multi Attribute Utility Theory ◽

Systematic Framework ◽

Selection Of

One of major obstacles in the application of the k-means clustering algorithm is the selection of the number of clusters k. The multi-attribute utility theory (MAUT)-based k-means clustering algorithm is proposed to tackle the problem by incorporating user preferences. Using MAUT, the decision maker's value structure for the number of clusters and other attributes can be quantitatively modeled, and it can be used as an objective function of the k-means. A target clustering problem for military targeting process is used to demonstrate the MAUT-based k-means and provide a comparative study. The result shows that the existing clustering algorithms do not necessarily reflect user preferences while the MAUT-based k-means provides a systematic framework of preferences modeling in cluster analysis.

Download Full-text

A novel bidirectional clustering algorithm based on local density

Scientific Reports ◽

10.1038/s41598-021-93244-2 ◽

2021 ◽

Vol 11 (1) ◽

Author(s):

Baicheng Lyu ◽

Wenhua Wu ◽

Zhiqiang Hu

Keyword(s):

Clustering Algorithm ◽

Local Density ◽

Clustering Algorithms ◽

Cluster Number ◽

Denoising Method ◽

Number Of Clusters ◽

Data Points ◽

Cutoff Distance ◽

Large Clusters ◽

Small Clusters

AbstractWith the widely application of cluster analysis, the number of clusters is gradually increasing, as is the difficulty in selecting the judgment indicators of cluster numbers. Also, small clusters are crucial to discovering the extreme characteristics of data samples, but current clustering algorithms focus mainly on analyzing large clusters. In this paper, a bidirectional clustering algorithm based on local density (BCALoD) is proposed. BCALoD establishes the connection between data points based on local density, can automatically determine the number of clusters, is more sensitive to small clusters, and can reduce the adjusted parameters to a minimum. On the basis of the robustness of cluster number to noise, a denoising method suitable for BCALoD is proposed. Different cutoff distance and cutoff density are assigned to each data cluster, which results in improved clustering performance. Clustering ability of BCALoD is verified by randomly generated datasets and city light satellite images.

Download Full-text

An Adaptive Multiobjective Genetic Algorithm with Fuzzy c-Means for Automatic Data Clustering

Mathematical Problems in Engineering ◽

10.1155/2018/6123874 ◽

2018 ◽

Vol 2018 ◽

pp. 1-13 ◽

Cited By ~ 2

Author(s):

Ze Dong ◽

Hao Jia ◽

Miao Liu

Keyword(s):

Genetic Algorithm ◽

Fuzzy Clustering ◽

Clustering Algorithm ◽

Majority Vote ◽

Clustering Algorithms ◽

Nsga Ii ◽

Number Of Clusters ◽

Automatic Data ◽

Multiobjective Genetic Algorithm ◽

Fuzzy Clustering Method

This paper presents a fuzzy clustering method based on multiobjective genetic algorithm. The ADNSGA2-FCM algorithm was developed to solve the clustering problem by combining the fuzzy clustering algorithm (FCM) with the multiobjective genetic algorithm (NSGA-II) and introducing an adaptive mechanism. The algorithm does not need to give the number of clusters in advance. After the number of initial clusters and the center coordinates are given randomly, the optimal solution set is found by the multiobjective evolutionary algorithm. After determining the optimal number of clusters by majority vote method, the Jm value is continuously optimized through the combination of Canonical Genetic Algorithm and FCM, and finally the best clustering result is obtained. By using standard UCI dataset verification and comparing with existing single-objective and multiobjective clustering algorithms, the effectiveness of this method is proved.

Download Full-text

A Quantitative Discriminant Method of Elbow Point for the Optimal Number of Clusters in Clustering Algorithm

10.21203/rs.3.rs-58011/v3 ◽

2021 ◽

Author(s):

Congming Shi ◽

Bingtao Wei ◽

Shoulin Wei ◽

Wen Wang ◽

Hai Liu ◽

...

Keyword(s):

Clustering Algorithm ◽

Clustering Algorithms ◽

Optimal Number ◽

Machine Learning Method ◽

Cluster Number ◽

Number Of Clusters ◽

Public Dataset ◽

Optimal Cluster ◽

Better Than ◽

Optimal Number Of Clusters

Abstract Clustering, a traditional machine learning method, plays a significant role in data analysis. Most clustering algorithms depend on a predetermined exact number of clusters, whereas, in practice, clusters are usually unpredictable. Although the Elbow method is one of the most commonly used methods to discriminate the optimal cluster number, the discriminant of the number of clusters depends on the manual identification of the elbow points on the visualization curve. Thus, experienced analysts cannot clearly identify the elbow point from the plotted curve when the plotted curve is fairly smooth. To solve this problem, a new elbow point discriminant method is proposed to yield a statistical metric that estimates an optimal cluster number when clustering on a dataset. First, the average degree of distortion obtained by the Elbow method is normalized to the range of 0 to 10. Second, the normalized results are used to calculate the cosine of intersection angles between elbow points. Third, this calculated cosine of intersection angles and the arccosine theorem are used to compute the intersection angles between elbow points. Finally, the index of the above computed minimal intersection angles between elbow points is used as the estimated potential optimal cluster number. The experimental results based on simulated datasets and a well-known public dataset (Iris Dataset) demonstrated that the estimated optimal cluster number obtained by our newly proposed method is better than the widely used Silhouette method.

Download Full-text

Clustering with Scikit-Learn in Python

The Programming Historian ◽

10.46430/phen0094 ◽

2021 ◽

Author(s):

Thomas Jurczyk

Keyword(s):

Data Analysis ◽

Exploratory Data Analysis ◽

Clustering Algorithms ◽

Use Cases ◽

Use Case ◽

Greco Roman ◽

Textual Data ◽

Exploratory Data ◽

Second Use

This tutorial demonstrates how to apply clustering algorithms with Python to a dataset with two concrete use cases. The first example uses clustering to identify meaningful groups of Greco-Roman authors based on their publications and their reception. The second use case applies clustering algorithms to textual data in order to discover thematic groups. After finishing this tutorial, you will be able to use clustering in Python with Scikit-learn applied to your own data, adding an invaluable method to your toolbox for exploratory data analysis.

Download Full-text

A novel bidirectional clustering algorithm based on local density

10.21203/rs.3.rs-141525/v1 ◽

2021 ◽

Author(s):

BAICHENG LV ◽

WENHUA WU ◽

ZHIQIANG HU

Keyword(s):

Clustering Algorithm ◽

Local Density ◽

Clustering Algorithms ◽

Cluster Number ◽

Denoising Method ◽

Number Of Clusters ◽

Data Points ◽

Cutoff Distance ◽

Large Clusters ◽

Small Clusters

Abstract With the widely application of cluster analysis, the number of clusters is gradually increasing, as is the difficulty in selecting the judgment indicators of cluster numbers. Also, small clusters are crucial to discovering the extreme characteristics of data samples, but current clustering algorithms focus mainly on analyzing large clusters. In this paper, a bidirectional clustering algorithm based on local density (BCALoD) is proposed. BCALoD establishes the connection between data points based on local density, can automatically determine the number of clusters, is more sensitive to small clusters, and can reduce the adjusted parameters to a minimum. On the basis of the robustness of cluster number to noise, a denoising method suitable for BCALoD is proposed. Different cutoff distance and cutoff density are assigned to each data cluster, which results in improved clustering performance. Clustering ability of BCALoD is verified by randomly generated datasets and city light satellite images.

Download Full-text

Rough ISODATA Algorithm

International Journal of Fuzzy System Applications ◽

10.4018/ijfsa.2013100101 ◽

2013 ◽

Vol 3 (4) ◽

pp. 1-14 ◽

Cited By ~ 2

Author(s):

S. Sampath ◽

B. Ramya

Keyword(s):

Clustering Algorithm ◽

Clustering Algorithms ◽

Real Life ◽

Vital Role ◽

Data Sets ◽

Clustering Method ◽

Data Set ◽

Number Of Clusters ◽

Real Life Data ◽

Nonparametric Statistical

Cluster analysis is a branch of data mining, which plays a vital role in bringing out hidden information in databases. Clustering algorithms help medical researchers in identifying the presence of natural subgroups in a data set. Different types of clustering algorithms are available in the literature. The most popular among them is k-means clustering. Even though k-means clustering is a popular clustering method widely used, its application requires the knowledge of the number of clusters present in the given data set. Several solutions are available in literature to overcome this limitation. The k-means clustering method creates a disjoint and exhaustive partition of the data set. However, in some situations one can come across objects that belong to more than one cluster. In this paper, a clustering algorithm capable of producing rough clusters automatically without requiring the user to give as input the number of clusters to be produced. The efficiency of the algorithm in detecting the number of clusters present in the data set has been studied with the help of some real life data sets. Further, a nonparametric statistical analysis on the results of the experimental study has been carried out in order to analyze the efficiency of the proposed algorithm in automatic detection of the number of clusters in the data set with the help of rough version of Davies-Bouldin index.

Download Full-text

Ant Custering Algorithms

International Journal of Applied Evolutionary Computation ◽

10.4018/jaec.2010010101 ◽

2010 ◽

Vol 1 (1) ◽

pp. 1-15 ◽

Cited By ~ 2

Author(s):

Yu-Chiun Chiou ◽

Shih-Ta Chou

Keyword(s):

Clustering Algorithm ◽

Clustering Algorithms ◽

Small Scale ◽

Solution Stability ◽

Clustering Methods ◽

Scale Problem ◽

Clustering Problem ◽

Genetic Clustering ◽

Fully Connected ◽

Pheromone Trail

This paper proposes three ant clustering algorithms (ACAs): ACA-1, ACA-2 and ACA-3. The core logic of the proposed ACAs is to modify the ant colony metaheuristic by reformulating the clustering problem into a network problem. For a clustering problem of N objects and K clusters, a fully connected network of N nodes is formed with link costs, representing the dissimilarity of any two nodes it connects. K ants are then to collect their own nodes according to the link costs and following the pheromone trail laid by previous ants. The proposed three ACAs have been validated on a small-scale problem solved by a total enumeration method. The solution effectiveness at different problem scales consistently shows that ACA-2 outperforms among these three ACAs. A further comparison of ACA-2 with other commonly used clustering methods, including agglomerative hierarchy clustering algorithm (AHCA), K-means algorithm (KMA) and genetic clustering algorithm (GCA), shows that ACA-2 significantly outperforms them in solution effectiveness for the most of cases and also performs considerably better in solution stability as the problem scales or the number of clusters gets larger.

Download Full-text

Bi-cross validation of spectral clustering hyperparameters

Powder Diffraction ◽

10.1017/s0885715620000214 ◽

2020 ◽

Vol 35 (2) ◽

pp. 112-116

Author(s):

Sioan Zohar ◽

Chun Hong Yoon

Keyword(s):

Spectral Clustering ◽

Cross Validation ◽

Clustering Algorithms ◽

Scattering Data ◽

Number Of Clusters ◽

X Ray ◽

Clustering Problem ◽

X Ray Scattering ◽

Linac Coherent Light Source ◽

Ray Scattering

One challenge impeding the analysis of terabyte scale X-ray scattering data from the Linac Coherent Light Source (LCLS) is determining the number of clusters required for the execution of traditional clustering algorithms. Here, we demonstrate that the previous work using bi-cross validation to determine the number of singular vectors directly maps to the spectral clustering problem of estimating both the number of clusters and hyperparameter values. Applying this method to LCLS X-ray scattering data enables the identification of dropped shots without manually setting boundaries on detector fluence and provides a path toward identifying rare and anomalous events.

Download Full-text