Research on K-Value Selection Method of K-Means Clustering Algorithm

Chunhui Yuan; Haitao Yang

doi:10.3390/j2020016

Research on K-Value Selection Method of K-Means Clustering Algorithm

J ◽

10.3390/j2020016 ◽

2019 ◽

Vol 2 (2) ◽

pp. 226-235 ◽

Cited By ~ 18

Author(s):

Chunhui Yuan ◽

Haitao Yang

Keyword(s):

Clustering Algorithm ◽

Clustering Algorithms ◽

Simple Algorithm ◽

Convergence Result ◽

Data Set ◽

K Value ◽

Standard Data ◽

Advantages And Disadvantages ◽

Gap Statistic ◽

Selection Algorithms

Among many clustering algorithms, the K-means clustering algorithm is widely used because of its simple algorithm and fast convergence. However, the K-value of clustering needs to be given in advance and the choice of K-value directly affect the convergence result. To solve this problem, we mainly analyze four K-value selection algorithms, namely Elbow Method, Gap Statistic, Silhouette Coefficient, and Canopy; give the pseudo code of the algorithm; and use the standard data set Iris for experimental verification. Finally, the verification results are evaluated, the advantages and disadvantages of the above four algorithms in a K-value selection are given, and the clustering range of the data set is pointed out.

CLUSTERING ALGORITHM RESEARCH BASED ON SELF-ORGANIZING FEATURE MAPS NETWORKS

International Journal of Pattern Recognition and Artificial Intelligence ◽

10.1142/s0218001406005149 ◽

2006 ◽

Vol 20 (07) ◽

pp. 985-1000 ◽

Cited By ~ 1

Author(s):

JUNHAO WEN ◽

HONGYAN WU ◽

ZHONGFU WU ◽

YUANYAN TANG ◽

GUANGHUI HE

Keyword(s):

Clustering Algorithm ◽

Clustering Algorithms ◽

Feature Maps ◽

Data Set ◽

Topological Similarity ◽

Standard Data ◽

Data Space ◽

New Concepts ◽

Different Characteristics ◽

Self Organizing

Self-organizing feature maps (SOFM) can learn both the distribution and topology of the input vectors they are trained on. According to this characteristic, we construct neural networks with a family of self-organizing feature maps to cluster the input data space. The proposed algorithm in this paper defines a novel similarity measure, topological similarity, and employs some new concepts, such as SOFM family, UsageFactor. The clustering algorithm handles the clusters with arbitrary shapes and avoid the limitations of the conventional clustering algorithms. We conclude our paper by several experiments with synthetic and standard data set of different characteristics, which show good performance of the proposed algorithm.

A SELF-ORGANIZING MAP FOR MIXED CONTINUOUS AND CATEGORICAL DATA

International Journal of Computing ◽

10.47839/ijc.10.1.733 ◽

2011 ◽

pp. 24-32 ◽

Cited By ~ 1

Author(s):

Nicoleta Rogovschi ◽

Mustapha Lebbah ◽

Younès Bennani

Keyword(s):

Clustering Algorithm ◽

Clustering Algorithms ◽

Mixed Data ◽

Categorical Variables ◽

Data Sets ◽

Self Organizing Map ◽

Data Set ◽

Public Data ◽

Self Organizing

Most traditional clustering algorithms are limited to handle data sets that contain either continuous or categorical variables. However data sets with mixed types of variables are commonly used in data mining field. In this paper we introduce a weighted self-organizing map for clustering, analysis and visualization mixed data (continuous/binary). The learning of weights and prototypes is done in a simultaneous manner assuring an optimized data clustering. More variables has a high weight, more the clustering algorithm will take into account the informations transmitted by these variables. The learning of these topological maps is combined with a weighting process of different variables by computing weights which influence the quality of clustering. We illustrate the power of this method with data sets taken from a public data set repository: a handwritten digit data set, Zoo data set and other three mixed data sets. The results show a good quality of the topological ordering and homogenous clustering.

Clustering Bathymetric Data for Electronic Navigational Charts

Journal of Navigation ◽

10.1017/s0373463316000035 ◽

2016 ◽

Vol 69 (5) ◽

pp. 1143-1153 ◽

Cited By ~ 24

Author(s):

Marta Wlodarczyk–Sielicka ◽

Andrzej Stateczny

Keyword(s):

Clustering Algorithm ◽

Search Algorithm ◽

Clustering Algorithms ◽

Data Set ◽

Bathymetric Data ◽

Large Sets ◽

Analysis Of Results ◽

Comparison And Analysis ◽

Self Organising Map ◽

Source Of Information

An electronic navigational chart is a major source of information for the navigator. The component that contributes most significantly to the safety of navigation on water is the information on the depth of an area. For the purposes of this article, the authors use data obtained by the interferometric sonar GeoSwath Plus. The data were collected in the area of the Port of Szczecin. The samples constitute large sets of data. Data reduction is a procedure to reduce the size of a data set to make it easier and more effective to analyse. The main objective of the authors is the compilation of a new reduction algorithm for bathymetric data. The clustering of data is the first part of the search algorithm. The next step consists of generalisation of bathymetric data. This article presents a comparison and analysis of results of clustering bathymetric data using the following selected methods:K-means clustering algorithm, traditional hierarchical clustering algorithms and self-organising map (using artificial neural networks).

Ontology-Based K-Means Clustering Algorithm Analysis

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.380-384.1290 ◽

2013 ◽

Vol 380-384 ◽

pp. 1290-1293

Author(s):

Qing Ju Guo ◽

Wen Tian Ji ◽

Sheng Zhong

Keyword(s):

Semantic Web ◽

Clustering Algorithm ◽

Algorithm Analysis ◽

Clustering Method ◽

Data Set ◽

Advantages And Disadvantages ◽

Research Findings ◽

Partition Clustering ◽

Improved Algorithm

Lots of research findings have been made from home and abroad on clustering algorithm in recent years. In view of the traditional partition clustering method K-means algorithm, this paper, after analyzing its advantages and disadvantages, combines it with ontology-based data set to establish a semantic web model. It improves the existing clustering algorithm in various constraint conditions with the aim of demonstrating that the improved algorithm has better efficiency and accuracy under semantic web.

Data Fusion Using a Multi-Sensor Sparse-Based Clustering Algorithm

Remote Sensing ◽

10.3390/rs12234007 ◽

2020 ◽

Vol 12 (23) ◽

pp. 4007

Author(s):

Kasra Rafiezadeh Shahi ◽

Pedram Ghamisi ◽

Behnood Rasti ◽

Robert Jackisch ◽

Paul Scheunders ◽

...

Keyword(s):

Clustering Algorithm ◽

Spatial Information ◽

Clustering Algorithms ◽

Hyperspectral Data ◽

Sensor Data ◽

Data Sets ◽

Data Types ◽

Data Set ◽

Multiple Data Sets ◽

Imaging Sensors

The increasing amount of information acquired by imaging sensors in Earth Sciences results in the availability of a multitude of complementary data (e.g., spectral, spatial, elevation) for monitoring of the Earth’s surface. Many studies were devoted to investigating the usage of multi-sensor data sets in the performance of supervised learning-based approaches at various tasks (i.e., classification and regression) while unsupervised learning-based approaches have received less attention. In this paper, we propose a new approach to fuse multiple data sets from imaging sensors using a multi-sensor sparse-based clustering algorithm (Multi-SSC). A technique for the extraction of spatial features (i.e., morphological profiles (MPs) and invariant attribute profiles (IAPs)) is applied to high spatial-resolution data to derive the spatial and contextual information. This information is then fused with spectrally rich data such as multi- or hyperspectral data. In order to fuse multi-sensor data sets a hierarchical sparse subspace clustering approach is employed. More specifically, a lasso-based binary algorithm is used to fuse the spectral and spatial information prior to automatic clustering. The proposed framework ensures that the generated clustering map is smooth and preserves the spatial structures of the scene. In order to evaluate the generalization capability of the proposed approach, we investigate its performance not only on diverse scenes but also on different sensors and data types. The first two data sets are geological data sets, which consist of hyperspectral and RGB data. The third data set is the well-known benchmark Trento data set, including hyperspectral and LiDAR data. Experimental results indicate that this novel multi-sensor clustering algorithm can provide an accurate clustering map compared to the state-of-the-art sparse subspace-based clustering algorithms.

A Dynamic Genetic Algorithm for Clustering Problems

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.411-414.1884 ◽

2013 ◽

Vol 411-414 ◽

pp. 1884-1893

Author(s):

Yong Chun Cao ◽

Ya Bin Shao ◽

Shuang Liang Tian ◽

Zheng Qi Cai

Keyword(s):

Genetic Algorithm ◽

Clustering Algorithm ◽

Clustering Algorithms ◽

Real Life ◽

Search Space ◽

Adaptive Mutation ◽

Data Sets ◽

Data Set ◽

Local Optima ◽

Clustering Problems

Due to many of the clustering algorithms based on GAs suffer from degeneracy and are easy to fall in local optima, a novel dynamic genetic algorithm for clustering problems (DGA) is proposed. The algorithm adopted the variable length coding to represent individuals and processed the parallel crossover operation in the subpopulation with individuals of the same length, which allows the DGA algorithm clustering to explore the search space more effectively and can automatically obtain the proper number of clusters and the proper partition from a given data set; the algorithm used the dynamic crossover probability and adaptive mutation probability, which prevented the dynamic clustering algorithm from getting stuck at a local optimal solution. The clustering results in the experiments on three artificial data sets and two real-life data sets show that the DGA algorithm derives better performance and higher accuracy on clustering problems.

SPSM: A NEW HYBRID DATA CLUSTERING ALGORITHM FOR NONLINEAR DATA ANALYSIS

International Journal of Pattern Recognition and Artificial Intelligence ◽

10.1142/s0218001409007685 ◽

2009 ◽

Vol 23 (08) ◽

pp. 1701-1737 ◽

Cited By ~ 3

Author(s):

UREERAT WATTANACHON ◽

CHIDCHANOK LURSINSAP

Keyword(s):

Clustering Algorithm ◽

Color Image ◽

Clustering Algorithms ◽

Noisy Data ◽

Second Phase ◽

Data Sets ◽

Data Set ◽

Cluster Distance ◽

Data Points ◽

Hybrid Data

Existing clustering algorithms, such as single-link clustering, k-means, CURE, and CSM are designed to find clusters based on predefined parameters specified by users. These algorithms may be unsuccessful if the choice of parameters is inappropriate with respect to the data set being clustered. Most of these algorithms work very well for compact and hyper-spherical clusters. In this paper, a new hybrid clustering algorithm called Self-Partition and Self-Merging (SPSM) is proposed. The SPSM algorithm partitions the input data set into several subclusters in the first phase and, then, removes the noisy data in the second phase. In the third phase, the normal subclusters are continuously merged to form the larger clusters based on the inter-cluster distance and intra-cluster distance criteria. From the experimental results, the SPSM algorithm is very efficient to handle the noisy data set, and to cluster the data sets of arbitrary shapes of different density. Several examples for color image show the versatility of the proposed method and compare with results described in the literature for the same images. The computational complexity of the SPSM algorithm is O(N2), where N is the number of data points.

STUDY ON ADAPTIVE PARAMETER DETERMINATION OF CLUSTER ANALYSIS IN URBAN MANAGEMENT CASES

ISPRS - International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences ◽

10.5194/isprs-archives-xlii-2-w7-1143-2017 ◽

2017 ◽

Vol XLII-2/W7 ◽

pp. 1143-1150 ◽

Cited By ~ 2

Author(s):

J. Y. Fu ◽

C. F. Jing ◽

M. Y. Du ◽

Y. L. Fu ◽

P. P. Dai

Keyword(s):

Clustering Algorithm ◽

Spatial Clustering ◽

Parameter Determination ◽

Urban Management ◽

Full Account ◽

Data Set ◽

K Value ◽

Global Parameter ◽

Parameter Adaptive

The fine management for cities is the important way to realize the smart city. The data mining which uses spatial clustering analysis for urban management cases can be used in the evaluation of urban public facilities deployment, and support the policy decisions, and also provides technical support for the fine management of the city. Aiming at the problem that DBSCAN algorithm which is based on the density-clustering can not realize parameter adaptive determination, this paper proposed the optimizing method of parameter adaptive determination based on the spatial analysis. Firstly, making analysis of the function Ripley's K for the data set to realize adaptive determination of global parameter MinPts, which means setting the maximum aggregation scale as the range of data clustering. Calculating every point object’s highest frequency K value in the range of Eps which uses K-D tree and setting it as the value of clustering density to realize the adaptive determination of global parameter MinPts. Then, the R language was used to optimize the above process to accomplish the precise clustering of typical urban management cases. The experimental results based on the typical case of urban management in XiCheng district of Beijing shows that: The new DBSCAN clustering algorithm this paper presents takes full account of the data’s spatial and statistical characteristic which has obvious clustering feature, and has a better applicability and high quality. The results of the study are not only helpful for the formulation of urban management policies and the allocation of urban management supervisors in XiCheng District of Beijing, but also to other cities and related fields.

GRAPH BASED CLUSTERING WITH CONSTRAINTS AND ACTIVE LEARNING

Journal of Computer Science and Cybernetics ◽

10.15625/1813-9663/37/1/15773 ◽

2021 ◽

Vol 37 (1) ◽

pp. 71-89

Author(s):

Vu-Tuan Dang ◽

Viet-Vu Vu ◽

Hong-Quan Do ◽

Thi Kieu Oanh Le

Keyword(s):

Active Learning ◽

Clustering Algorithm ◽

Side Information ◽

Clustering Algorithms ◽

Real Data ◽

Data Sets ◽

Data Set ◽

Supervised Clustering ◽

Class Labels ◽

Graph Based Clustering

During the past few years, semi-supervised clustering has emerged as a new interesting direction in machine learning research. In a semi-supervised clustering algorithm, the clustering results can be significantly improved by using side information, which is available or collected from users. There are two main kinds of side information that can be learned in semi-supervised clustering algorithms: the class labels - called seeds or the pairwise constraints. The first semi-supervised clustering was introduced in 2000, and since that, many algorithms have been presented in literature. However, it is not easy to use both types of side information in the same algorithm. To address the problem, this paper proposes a semi-supervised graph based clustering algorithm that tries to use seeds and constraints in the clustering process, called MCSSGC. Moreover, we introduces a simple but efficient active learning method to collect the constraints that can boost the performance of MCSSGC, named KMMFFQS. In order to verify effectiveness of the proposed algorithm, we conducted a series of experiments not only on real data sets from UCI, but also on a document data set applied in an Information Extraction of Vietnamese documents. These obtained results show that the proposed algorithm can significantly improve the clustering process compared to some recent algorithms.

Data Analysis Using Representation Theory and Clustering Algorithms

WSEAS TRANSACTIONS ON COMPUTERS ◽

10.37394/23205.2020.19.38 ◽

2021 ◽

Vol 19 ◽

pp. 310-320

Author(s):

Suboh Alkhushayni ◽

Taeyoung Choi ◽

Du’a Alzaleq

Keyword(s):

Data Analysis ◽

Random Forest ◽

Hierarchical Clustering ◽

Clustering Algorithm ◽

Clustering Algorithms ◽

Optimal Number ◽

Categorical Variables ◽

Common Disease ◽

Agglomerative Hierarchical Clustering ◽

Data Set

This work aims to expand the knowledge of the area of data analysis through both persistence homology, as well as representations of directed graphs. To be specific, we looked for how we can analyze homology cluster groups using agglomerative Hierarchical Clustering algorithms and methods. Additionally, the Wine data, which is offered in R studio, was analyzed using various cluster algorithms such as Hierarchical Clustering, K-Means Clustering, and PAM Clustering. The goal of the analysis was to find out which cluster's method is proper for a given numerical data set. By testing the data, we tried to find the agglomerative hierarchical clustering method that will be the optimal clustering algorithm among these three; K-Means, PAM, and Random Forest methods. By comparing each model's accuracy value with cultivar coefficients, we came with a conclusion that K-Means methods are the most helpful when working with numerical variables. On the other hand, PAM clustering and Gower with random forest are the most beneficial approaches when working with categorical variables. All these tests can determine the optimal number of clustering groups, given the data set, and by doing the proper analysis. Using those the project, we can apply our method to several industrial areas such that clinical, business, and others. For example, people can make different groups based on each patient who has a common disease, required therapy, and other things in the clinical society. Additionally, for the business area, people can expect to get several clustered groups based on the marginal profit, marginal cost, or other economic indicators.