SMART: a subspace clustering algorithm that automatically identifies the appropriate number of clusters

AbstractWith the widely application of cluster analysis, the number of clusters is gradually increasing, as is the difficulty in selecting the judgment indicators of cluster numbers. Also, small clusters are crucial to discovering the extreme characteristics of data samples, but current clustering algorithms focus mainly on analyzing large clusters. In this paper, a bidirectional clustering algorithm based on local density (BCALoD) is proposed. BCALoD establishes the connection between data points based on local density, can automatically determine the number of clusters, is more sensitive to small clusters, and can reduce the adjusted parameters to a minimum. On the basis of the robustness of cluster number to noise, a denoising method suitable for BCALoD is proposed. Different cutoff distance and cutoff density are assigned to each data cluster, which results in improved clustering performance. Clustering ability of BCALoD is verified by randomly generated datasets and city light satellite images.

Download Full-text

A meta-heuristic density-based subspace clustering algorithm for high-dimensional data

Soft Computing ◽

10.1007/s00500-021-05973-1 ◽

2021 ◽

Author(s):

Parul Agarwal ◽

Shikha Mehta ◽

Ajith Abraham

Keyword(s):

Clustering Algorithm ◽

High Dimensional Data ◽

Subspace Clustering ◽

High Dimensional

Download Full-text

Automatic image annotation and retrieval using subspace clustering algorithm

Proceedings of the 2nd ACM international workshop on Multimedia databases - MMDB '04 ◽

10.1145/1032604.1032621 ◽

2004 ◽

Cited By ~ 25

Author(s):

Lei Wang ◽

Li Liu ◽

Latifur Khan

Keyword(s):

Clustering Algorithm ◽

Image Annotation ◽

Subspace Clustering ◽

Automatic Image Annotation

Download Full-text

A dynamic genetic clustering algorithm for automatic choice of the number of clusters

2011 9th IEEE International Conference on Control and Automation (ICCA) ◽

10.1109/icca.2011.6137921 ◽

2011 ◽

Cited By ~ 2

Author(s):

Hong He ◽

Yonghong Tan

Keyword(s):

Clustering Algorithm ◽

Number Of Clusters ◽

Genetic Clustering ◽

Automatic Choice

Download Full-text

Self-Adaptive K-Means Based on a Covering Algorithm

Complexity ◽

10.1155/2018/7698274 ◽

2018 ◽

Vol 2018 ◽

pp. 1-16 ◽

Cited By ~ 1

Author(s):

Yiwen Zhang ◽

Yuanyuan Zhou ◽

Xing Guo ◽

Jintao Wu ◽

Qiang He ◽

...

Keyword(s):

Large Scale ◽

Clustering Algorithm ◽

Real Data ◽

Second Phase ◽

Data Sets ◽

Number Of Clusters ◽

Large Scale Data ◽

Long Time ◽

Two Phases ◽

Selection Of

The K-means algorithm is one of the ten classic algorithms in the area of data mining and has been studied by researchers in numerous fields for a long time. However, the value of the clustering number k in the K-means algorithm is not always easy to be determined, and the selection of the initial centers is vulnerable to outliers. This paper proposes an improved K-means clustering algorithm called the covering K-means algorithm (C-K-means). The C-K-means algorithm can not only acquire efficient and accurate clustering results but also self-adaptively provide a reasonable numbers of clusters based on the data features. It includes two phases: the initialization of the covering algorithm (CA) and the Lloyd iteration of the K-means. The first phase executes the CA. CA self-organizes and recognizes the number of clusters k based on the similarities in the data, and it requires neither the number of clusters to be prespecified nor the initial centers to be manually selected. Therefore, it has a “blind” feature, that is, k is not preselected. The second phase performs the Lloyd iteration based on the results of the first phase. The C-K-means algorithm combines the advantages of CA and K-means. Experiments are carried out on the Spark platform, and the results verify the good scalability of the C-K-means algorithm. This algorithm can effectively solve the problem of large-scale data clustering. Extensive experiments on real data sets show that the accuracy and efficiency of the C-K-means algorithm outperforms the existing algorithms under both sequential and parallel conditions.

Download Full-text

ENTROPY-BASED CLUSTER VALIDATION AND ESTIMATION OF THE NUMBER OF CLUSTERS IN GENE EXPRESSION DATA

Journal of Bioinformatics and Computational Biology ◽

10.1142/s0219720012500114 ◽

2012 ◽

Vol 10 (05) ◽

pp. 1250011

Author(s):

NATALIA NOVOSELOVA ◽

IGOR TOM

Keyword(s):

Gene Expression ◽

Gene Expression Data ◽

Clustering Algorithm ◽

Selection Procedure ◽

Biological Knowledge ◽

Consensus Clustering ◽

Expression Data ◽

Cluster Validation ◽

Number Of Clusters ◽

Validity Measure

Many external and internal validity measures have been proposed in order to estimate the number of clusters in gene expression data but as a rule they do not consider the analysis of the stability of the groupings produced by a clustering algorithm. Based on the approach assessing the predictive power or stability of a partitioning, we propose the new measure of cluster validation and the selection procedure to determine the suitable number of clusters. The validity measure is based on the estimation of the "clearness" of the consensus matrix, which is the result of a resampling clustering scheme or consensus clustering. According to the proposed selection procedure the stable clustering result is determined with the reference to the validity measure for the null hypothesis encoding for the absence of clusters. The final number of clusters is selected by analyzing the distance between the validity plots for initial and permutated data sets. We applied the selection procedure to estimate the clustering results on several datasets. As a result the proposed procedure produced an accurate and robust estimate of the number of clusters, which are in agreement with the biological knowledge and gold standards of cluster quality.

Download Full-text

TCLUST: Trimming Approach of Robust Clustering Method

Malaysian Journal of Fundamental and Applied Sciences ◽

10.11113/mjfas.v8n4.154 ◽

2014 ◽

Vol 8 (4) ◽

Author(s):

Muhamad Alias Md. Jedi ◽

Robiah Adnan

Keyword(s):

Clustering Algorithm ◽

Likelihood Function ◽

R Package ◽

Clustering Method ◽

Number Of Clusters ◽

Robust Clustering ◽

Scatter Matrix ◽

Group Assignment ◽

Log Likelihood ◽

Clustering Approach

TCLUST is a method in statistical clustering technique which is based on modification of trimmed k-means clustering algorithm. It is called “crisp” clustering approach because the observation is can be eliminated or assigned to a group. TCLUST strengthen the group assignment by putting constraint to the cluster scatter matrix. The emphasis in this paper is to restrict on the eigenvalues, λ of the scatter matrix. The idea of imposing constraints is to maximize the log-likelihood function of spurious-outlier model. A review of different robust clustering approach is presented as a comparison to TCLUST methods. This paper will discuss the nature of TCLUST algorithm and how to determine the number of cluster or group properly and measure the strength of group assignment. At the end of this paper, R-package on TCLUST implement the types of scatter restriction, making the algorithm to be more flexible for choosing the number of clusters and the trimming proportion.

Download Full-text

SISTEM APLIKASI BERBASIS OPTIMASI METODE ELBOW UNTUK PENENTUAN CLUSTERING PELANGGAN

JOUTICA ◽

10.30736/jti.v3i1.196 ◽

2018 ◽

Vol 3 (1) ◽

pp. 117 ◽

Cited By ~ 1

Author(s):

Elly Muningsih ◽

Sri Kiswati

Keyword(s):

Data Center ◽

Clustering Algorithm ◽

Visual Basic ◽

Customer Management ◽

Number Of Clusters ◽

Transaction Data ◽

Development Method ◽

Popular Method ◽

Or Groups ◽

Cluster 2

Customer is a very important asset for the company. Having customers who are loyal to the company is an absolute and important for the progress of the company. This study aims to help companies, especially in the online shop to create a better customer management by identifying and grouping customers into several clusters or groups to know the characteristics of their loyalty to the company. The method used in this research is K-Means method which is one of the best and most popular method in clustering algorithm. To overcome the weakness of the K-Means method in determining the number of clusters, we use the Elbow method where this method gets the comparison of the number of clusters added by calculating the SSE (Sum of Square Error) of each cluster value. This research starts from collecting the necessary data and will be processed. From total transaction data 478 then done cleaning of data and result 73 data. Then the data processed with RapidMiner software from Cluster 2 up to 10 to search the data center of each cluster. From the calculated SSE value found that the best number of clusters is 3. The end result of the research is a Visual Basic based application program that is expected to provide ease in grouping or clustering customers. Software development method using Waterfall method.

Download Full-text

A soft subspace clustering algorithm with log-transformed distances

Big Data and Information Analytics ◽

10.3934/bdia.2016.1.93 ◽

2015 ◽

Vol 1 (1) ◽

pp. 93-109 ◽

Cited By ~ 3

Author(s):

Guojun Gan ◽

Kun Chen

Keyword(s):

Clustering Algorithm ◽

Subspace Clustering

Download Full-text