K-Means-Based Nature-Inspired Metaheuristic Algorithms for Automatic Data Clustering Problems: Recent Advances and Future Directions

K-means clustering algorithm is a partitional clustering algorithm that has been used widely in many applications for traditional clustering due to its simplicity and low computational complexity. This clustering technique depends on the user specification of the number of clusters generated from the dataset, which affects the clustering results. Moreover, random initialization of cluster centers results in its local minimal convergence. Automatic clustering is a recent approach to clustering where the specification of cluster number is not required. In automatic clustering, natural clusters existing in datasets are identified without any background information of the data objects. Nature-inspired metaheuristic optimization algorithms have been deployed in recent times to overcome the challenges of the traditional clustering algorithm in handling automatic data clustering. Some nature-inspired metaheuristics algorithms have been hybridized with the traditional K-means algorithm to boost its performance and capability to handle automatic data clustering problems. This study aims to identify, retrieve, summarize, and analyze recently proposed studies related to the improvements of the K-means clustering algorithm with nature-inspired optimization techniques. A quest approach for article selection was adopted, which led to the identification and selection of 147 related studies from different reputable academic avenues and databases. More so, the analysis revealed that although the K-means algorithm has been well researched in the literature, its superiority over several well-established state-of-the-art clustering algorithms in terms of speed, accessibility, simplicity of use, and applicability to solve clustering problems with unlabeled and nonlinearly separable datasets has been clearly observed in the study. The current study also evaluated and discussed some of the well-known weaknesses of the K-means clustering algorithm, for which the existing improvement methods were conceptualized. It is noteworthy to mention that the current systematic review and analysis of existing literature on K-means enhancement approaches presents possible perspectives in the clustering analysis research domain and serves as a comprehensive source of information regarding the K-means algorithm and its variants for the research community.

Download Full-text

A Robust Fuzzy Approach For Gene Expression Data Clustering

10.21203/rs.3.rs-547452/v1 ◽

2021 ◽

Author(s):

Meskat Jahan ◽

Mahmudul Hasan

Keyword(s):

Gene Expression ◽

Data Clustering ◽

Missing Values ◽

Clustering Algorithm ◽

Clustering Algorithms ◽

Expression Data ◽

Mining Method ◽

Cluster Number ◽

Gene Expression Data Clustering ◽

Parameter Dependent

Abstract In the big data era, clustering is one of the most popular data mining method. The majority of clustering algorithms have complications like automatic cluster number determination, poor clustering precision, inconsistent clustering of various datasets and parameter-dependent etc. A new fuzzy autonomous solution for clustering named Meskat-Mahmudul (MM) clustering algorithm proposed to overcome the complexity of parameter–free automatic cluster number determination and clustering accuracy. MM clustering algorithm finds out the exact number of clusters based on Average Silhouette method in multivariate mixed attribute dataset, including real-time gene expression dataset and dealt missing values, noise and outliers. MM Extended K-Means (MMK) clustering algorithm is an enhancement of the K-Means algorithm, which serves the purpose for automatic cluster discovery and runtime cluster placement. Several validation methods used to evaluate cluster and certify optimum cluster partitioning and perfection. Some datasets used to assess the performance of the proposed algorithms to other algorithms in terms of time complexity and clustering efficiency. Finally, MM clustering and MMK clustering algorithms found superior over conventional algorithms.

Download Full-text

A novel bidirectional clustering algorithm based on local density

Scientific Reports ◽

10.1038/s41598-021-93244-2 ◽

2021 ◽

Vol 11 (1) ◽

Author(s):

Baicheng Lyu ◽

Wenhua Wu ◽

Zhiqiang Hu

Keyword(s):

Clustering Algorithm ◽

Local Density ◽

Clustering Algorithms ◽

Cluster Number ◽

Denoising Method ◽

Number Of Clusters ◽

Data Points ◽

Cutoff Distance ◽

Large Clusters ◽

Small Clusters

AbstractWith the widely application of cluster analysis, the number of clusters is gradually increasing, as is the difficulty in selecting the judgment indicators of cluster numbers. Also, small clusters are crucial to discovering the extreme characteristics of data samples, but current clustering algorithms focus mainly on analyzing large clusters. In this paper, a bidirectional clustering algorithm based on local density (BCALoD) is proposed. BCALoD establishes the connection between data points based on local density, can automatically determine the number of clusters, is more sensitive to small clusters, and can reduce the adjusted parameters to a minimum. On the basis of the robustness of cluster number to noise, a denoising method suitable for BCALoD is proposed. Different cutoff distance and cutoff density are assigned to each data cluster, which results in improved clustering performance. Clustering ability of BCALoD is verified by randomly generated datasets and city light satellite images.

Download Full-text

A Research Roadmap of Big Data Clustering Algorithms for Future Internet of Things

International Journal of Organizational and Collective Intelligence ◽

10.4018/ijoci.2019040102 ◽

2019 ◽

Vol 9 (2) ◽

pp. 16-30 ◽

Cited By ~ 1

Author(s):

Hind Bangui ◽

Mouzhi Ge ◽

Barbora Buhnova

Keyword(s):

Big Data ◽

Internet Of Things ◽

Mobile Networks ◽

Data Clustering ◽

Clustering Algorithm ◽

Clustering Algorithms ◽

Future Internet ◽

Research Challenges ◽

Initial Stage ◽

Big Data Technologies

Due to the massive data increase in different Internet of Things (IoT) domains such as healthcare IoT and Smart City IoT, Big Data technologies have been emerged as critical analytics tools for analyzing the IoT data. Among the Big Data technologies, data clustering is one of the essential approaches to process the IoT data. However, how to select a suitable clustering algorithm for IoT data is still unclear. Furthermore, since Big Data technology are still in its initial stage for different IoT domains, it is thus valuable to propose and structure the research challenges between Big Data and IoT. Therefore, this article starts by reviewing and comparing the data clustering algorithms that can be applied in IoT datasets, and then extends the discussions to a broader IoT context such as IoT dynamics and IoT mobile networks. Finally, this article identifies a set of research challenges that harvest a research roadmap for the Big Data research in IoT domains. The proposed research roadmap aims at bridging the research gaps between Big Data and various IoT contexts.

Download Full-text

An Adaptive Multiobjective Genetic Algorithm with Fuzzy c-Means for Automatic Data Clustering

Mathematical Problems in Engineering ◽

10.1155/2018/6123874 ◽

2018 ◽

Vol 2018 ◽

pp. 1-13 ◽

Cited By ~ 2

Author(s):

Ze Dong ◽

Hao Jia ◽

Miao Liu

Keyword(s):

Genetic Algorithm ◽

Fuzzy Clustering ◽

Clustering Algorithm ◽

Majority Vote ◽

Clustering Algorithms ◽

Nsga Ii ◽

Number Of Clusters ◽

Automatic Data ◽

Multiobjective Genetic Algorithm ◽

Fuzzy Clustering Method

This paper presents a fuzzy clustering method based on multiobjective genetic algorithm. The ADNSGA2-FCM algorithm was developed to solve the clustering problem by combining the fuzzy clustering algorithm (FCM) with the multiobjective genetic algorithm (NSGA-II) and introducing an adaptive mechanism. The algorithm does not need to give the number of clusters in advance. After the number of initial clusters and the center coordinates are given randomly, the optimal solution set is found by the multiobjective evolutionary algorithm. After determining the optimal number of clusters by majority vote method, the Jm value is continuously optimized through the combination of Canonical Genetic Algorithm and FCM, and finally the best clustering result is obtained. By using standard UCI dataset verification and comparing with existing single-objective and multiobjective clustering algorithms, the effectiveness of this method is proved.

Download Full-text

Effective fuzzy c-means clustering algorithms for data clustering problems

Expert Systems with Applications ◽

10.1016/j.eswa.2011.11.063 ◽

2012 ◽

Vol 39 (7) ◽

pp. 6292-6300 ◽

Cited By ~ 46

Author(s):

S.R. Kannan ◽

S. Ramathilagam ◽

P.C. Chung

Keyword(s):

Data Clustering ◽

Clustering Algorithms ◽

Fuzzy C Means ◽

Fuzzy C Means Clustering ◽

Clustering Problems

Download Full-text

Research on Parallel DBSCAN Algorithm Design Based on MapReduce

Advanced Materials Research ◽

10.4028/www.scientific.net/amr.301-303.1133 ◽

2011 ◽

Vol 301-303 ◽

pp. 1133-1138 ◽

Cited By ~ 17

Author(s):

Yan Xiang Fu ◽

Wei Zhong Zhao ◽

Hui Fang Ma

Keyword(s):

Data Clustering ◽

Large Scale ◽

Clustering Algorithm ◽

Clustering Algorithms ◽

Algorithm Design ◽

Document Retrieval ◽

Commodity Hardware ◽

Dbscan Clustering ◽

Dbscan Algorithm ◽

Parallel Clustering

Data clustering has been received considerable attention in many applications, such as data mining, document retrieval, image segmentation and pattern classification. The enlarging volumes of information emerging by the progress of technology, makes clustering of very large scale of data a challenging task. In order to deal with the problem, more researchers try to design efficient parallel clustering algorithms. In this paper, we propose a parallel DBSCAN clustering algorithm based on Hadoop, which is a simple yet powerful parallel programming platform. The experimental results demonstrate that the proposed algorithm can scale well and efficiently process large datasets on commodity hardware.

Download Full-text

A Dynamic Genetic Algorithm for Clustering Problems

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.411-414.1884 ◽

2013 ◽

Vol 411-414 ◽

pp. 1884-1893

Author(s):

Yong Chun Cao ◽

Ya Bin Shao ◽

Shuang Liang Tian ◽

Zheng Qi Cai

Keyword(s):

Genetic Algorithm ◽

Clustering Algorithm ◽

Clustering Algorithms ◽

Real Life ◽

Search Space ◽

Adaptive Mutation ◽

Data Sets ◽

Data Set ◽

Local Optima ◽

Clustering Problems

Due to many of the clustering algorithms based on GAs suffer from degeneracy and are easy to fall in local optima, a novel dynamic genetic algorithm for clustering problems (DGA) is proposed. The algorithm adopted the variable length coding to represent individuals and processed the parallel crossover operation in the subpopulation with individuals of the same length, which allows the DGA algorithm clustering to explore the search space more effectively and can automatically obtain the proper number of clusters and the proper partition from a given data set; the algorithm used the dynamic crossover probability and adaptive mutation probability, which prevented the dynamic clustering algorithm from getting stuck at a local optimal solution. The clustering results in the experiments on three artificial data sets and two real-life data sets show that the DGA algorithm derives better performance and higher accuracy on clustering problems.

Download Full-text

An Automatic Data Clustering Algorithm Based on Differential Evolution

2013 IEEE International Conference on Systems, Man, and Cybernetics ◽

10.1109/smc.2013.140 ◽

2013 ◽

Cited By ~ 2

Author(s):

Chun-Wei Tsai ◽

Chiech-An Tai ◽

Ming-Chao Chiang

Keyword(s):

Differential Evolution ◽

Data Clustering ◽

Clustering Algorithm ◽

Automatic Data

Download Full-text

A Modified Overlapping Partitioning Clustering Algorithm for Categorical Data Clustering

Bulletin of Electrical Engineering and Informatics ◽

10.11591/eei.v7i1.896 ◽

2018 ◽

Vol 7 (1) ◽

pp. 55-62

Author(s):

Mohammad Alaqtash ◽

Moayad A.Fadhil ◽

Ali F. Al-Azzawi

Keyword(s):

Categorical Data ◽

Clustering Algorithm ◽

Clustering Algorithms ◽

Numerical Data ◽

Data Representation ◽

The Past ◽

Textual Data ◽

Traditional Algorithm ◽

Clustering Problems ◽

Categorical Data Clustering

Clustering is one of the important approaches for Clustering enables the grouping of unlabeled data by partitioning data into clusters with similar patterns. Over the past decades, many clustering algorithms have been developed for various clustering problems. An overlapping partitioning clustering (OPC) algorithm can only handle numerical data. Hence, novel clustering algorithms have been studied extensively to overcome this issue. By increasing the number of objects belonging to one cluster and distance between cluster centers, the study aimed to cluster the textual data type without losing the main functions. The proposed study herein included over twenty newsgroup dataset, which consisted of approximately 20000 textual documents. By introducing some modifications to the traditional algorithm, an acceptable level of homogeneity and completeness of clusters were generated. Modifications were performed on the pre-processing phase and data representation, along with the number methods which influence the primary function of the algorithm. Subsequently, the results were evaluated and compared with the k-means algorithm of the training and test datasets. The results indicated that the modified algorithm could successfully handle the categorical data and produce satisfactory clusters.

Download Full-text

A Quantitative Discriminant Method of Elbow Point for the Optimal Number of Clusters in Clustering Algorithm

10.21203/rs.3.rs-58011/v3 ◽

2021 ◽

Author(s):

Congming Shi ◽

Bingtao Wei ◽

Shoulin Wei ◽

Wen Wang ◽

Hai Liu ◽

...

Keyword(s):

Clustering Algorithm ◽

Clustering Algorithms ◽

Optimal Number ◽

Machine Learning Method ◽

Cluster Number ◽

Number Of Clusters ◽

Public Dataset ◽

Optimal Cluster ◽

Better Than ◽

Optimal Number Of Clusters

Abstract Clustering, a traditional machine learning method, plays a significant role in data analysis. Most clustering algorithms depend on a predetermined exact number of clusters, whereas, in practice, clusters are usually unpredictable. Although the Elbow method is one of the most commonly used methods to discriminate the optimal cluster number, the discriminant of the number of clusters depends on the manual identification of the elbow points on the visualization curve. Thus, experienced analysts cannot clearly identify the elbow point from the plotted curve when the plotted curve is fairly smooth. To solve this problem, a new elbow point discriminant method is proposed to yield a statistical metric that estimates an optimal cluster number when clustering on a dataset. First, the average degree of distortion obtained by the Elbow method is normalized to the range of 0 to 10. Second, the normalized results are used to calculate the cosine of intersection angles between elbow points. Third, this calculated cosine of intersection angles and the arccosine theorem are used to compute the intersection angles between elbow points. Finally, the index of the above computed minimal intersection angles between elbow points is used as the estimated potential optimal cluster number. The experimental results based on simulated datasets and a well-known public dataset (Iris Dataset) demonstrated that the estimated optimal cluster number obtained by our newly proposed method is better than the widely used Silhouette method.

Download Full-text