scholarly journals An Empirical Study of Cluster-Based MOEA/D Bare Bones PSO for Data Clustering

Algorithms ◽  
2021 ◽  
Vol 14 (11) ◽  
pp. 338
Author(s):  
Daphne Teck Ching Lai ◽  
Yuji Sato

Previously, cluster-based multi or many objective function techniques were proposed to reduce the Pareto set. Recently, researchers proposed such techniques to find better solutions in the objective space to solve engineering problems. In this work, we applied a cluster-based approach for solution selection in a multiobjective evolutionary algorithm based on decomposition with bare bones particle swarm optimization for data clustering and investigated its clustering performance. In our previous work, we found that MOEA/D with BBPSO performed the best on 10 datasets. Here, we extend this work applying a cluster-based approach tested on 13 UCI datasets. We compared with six multiobjective evolutionary clustering algorithms from the existing literature and ten from our previous work. The proposed technique was found to perform well on datasets highly overlapping clusters, such as CMC and Sonar. So far, we found only one work that used cluster-based MOEA for clustering data, the hierarchical topology multiobjective clustering algorithm. All other cluster-based MOEA found were used to solve other problems that are not data clustering problems. By clustering Pareto solutions and evaluating new candidates against the found cluster representatives, local search is introduced in the solution selection process within the objective space, which can be effective on datasets with highly overlapping clusters. This is an added layer of search control in the objective space. The results are found to be promising, prompting different areas of future research which are discussed, including the study of its effects with an increasing number of clusters as well as with other objective functions.

2011 ◽  
Vol 291-294 ◽  
pp. 344-348
Author(s):  
Lin Lin ◽  
Shu Yan ◽  
Yi Nian

The hierarchical topology of wireless sensor networks can effectively reduce the consumption in communication. Clustering algorithm is the foundation to realize herarchical structure, so it has been extensive researched. On the basis of Leach algorithm, a distance density based clustering algorithm (DDBC) is proposed, considering synthetically the distribution density of around nodes and the remaining energy factors of the node to dynamically banlance energy usage of nodes when selecting cluster heads. We analyzed the performance of DDBC through compared with the existing other clustering algorithms in simulation experiment. Results show that the proposed method can generare stable quantity cluster heads and banlance the energy load effectively.


Author(s):  
Hind Bangui ◽  
Mouzhi Ge ◽  
Barbora Buhnova

Due to the massive data increase in different Internet of Things (IoT) domains such as healthcare IoT and Smart City IoT, Big Data technologies have been emerged as critical analytics tools for analyzing the IoT data. Among the Big Data technologies, data clustering is one of the essential approaches to process the IoT data. However, how to select a suitable clustering algorithm for IoT data is still unclear. Furthermore, since Big Data technology are still in its initial stage for different IoT domains, it is thus valuable to propose and structure the research challenges between Big Data and IoT. Therefore, this article starts by reviewing and comparing the data clustering algorithms that can be applied in IoT datasets, and then extends the discussions to a broader IoT context such as IoT dynamics and IoT mobile networks. Finally, this article identifies a set of research challenges that harvest a research roadmap for the Big Data research in IoT domains. The proposed research roadmap aims at bridging the research gaps between Big Data and various IoT contexts.


2011 ◽  
Vol 301-303 ◽  
pp. 1133-1138 ◽  
Author(s):  
Yan Xiang Fu ◽  
Wei Zhong Zhao ◽  
Hui Fang Ma

Data clustering has been received considerable attention in many applications, such as data mining, document retrieval, image segmentation and pattern classification. The enlarging volumes of information emerging by the progress of technology, makes clustering of very large scale of data a challenging task. In order to deal with the problem, more researchers try to design efficient parallel clustering algorithms. In this paper, we propose a parallel DBSCAN clustering algorithm based on Hadoop, which is a simple yet powerful parallel programming platform. The experimental results demonstrate that the proposed algorithm can scale well and efficiently process large datasets on commodity hardware.


2018 ◽  
Vol 27 (3) ◽  
pp. 317-329 ◽  
Author(s):  
Satish Chander ◽  
P. Vijaya ◽  
Praveen Dhyani

Abstract The progress of databases in fields such as medical, business, education, marketing, etc., is colossal because of the developments in information technology. Knowledge discovery from such concealed bulk databases is a tedious task. For this, data mining is one of the promising solutions and clustering is one of its applications. The clustering process groups the data objects related to each other in a similar cluster and diverse objects in another cluster. The literature presents many clustering algorithms for data clustering. Optimisation-based clustering algorithm is one of the recently developed algorithms for the clustering process to discover the optimal cluster based on the objective function. In our previous method, direct operative fractional lion optimisation algorithm was proposed for data clustering. In this paper, we designed a new clustering algorithm called adaptive decisive operative fractional lion (ADOFL) optimisation algorithm based on multi-kernel function. Moreover, a new fitness function called multi-kernel WL index is proposed for the selection of the best centroid point for clustering. The experimentation of the proposed ADOFL algorithm is carried out over two benchmarked datasets, Iris and Wine. The performance of the proposed ADOFL algorithm is validated over existing clustering algorithms such as particle swarm clustering (PSC) algorithm, modified PSC algorithm, lion algorithm, fractional lion algorithm, and DOFL. The result shows that the maximum clustering accuracy of 79.51 is obtained by the proposed method in data clustering.


2020 ◽  
Vol 309 ◽  
pp. 03008 ◽  
Author(s):  
Zhiguo Liu ◽  
Changqing Ren ◽  
Wenzhu Cai

In the process of identifying unknown protocol of bit stream, the clustering of data sets of bit stream in the protocol is the basis of further identifying unknown protocol. Therefore, on the one hand, this paper analyses the classical clustering algorithms used in unknown protocol recognition from three perspectives: the whole process of clustering analysis, similarity measurement and clustering result evaluation. On the other hand, the development trend of clustering algorithm in unknown protocol recognition is summarized, and other problems in unknown protocol recognition can be solved by clustering algorithm according to the characteristics of bit stream data set, which can provide reference for future research work. Finally, the challenges faced by the Research Institute and the prospects for future work are given.


Data clustering is inevitable for crucial data analytic based applications. Though data clustering algorithms are capacious in the literature, there is always a room for efficient data clustering algorithms. This is due to the uncontrollable growth of data and its utilization. The data clustering may consider any of the data formats such as text, images, audio, video and so on. Due to the increasing utilization trend of digital images, this work intends to present a data clustering algorithm for digital images, which is based colour distance and Improvised DBSCAN (IDBSCAN) algorithm. The proposed IDBSCAN completely weeds out the annoying process of setting the initial parameters such as 𝜺 and 𝒎𝒊𝒏𝒑𝒕𝒔 by setting them automatically. The performance of the proposed work is analysed in terms of clustering accuracy, precision, recall, Fmeasure and time consumption rates. The proposed work outperforms the existing approaches with reasonable time consumption.


2021 ◽  
Vol 11 (4) ◽  
pp. 319-330
Author(s):  
Artur Starczewski ◽  
Magdalena M. Scherer ◽  
Wojciech Książek ◽  
Maciej Dębski ◽  
Lipo Wang

Abstract Data clustering is an important method used to discover naturally occurring structures in datasets. One of the most popular approaches is the grid-based concept of clustering algorithms. This kind of method is characterized by a fast processing time and it can also discover clusters of arbitrary shapes in datasets. These properties allow these methods to be used in many different applications. Researchers have created many versions of the clustering method using the grid-based approach. However, the key issue is the right choice of the number of grid cells. This paper proposes a novel grid-based algorithm which uses a method for an automatic determining of the number of grid cells. This method is based on the kdist function which computes the distance between each element of a dataset and its kth nearest neighbor. Experimental results have been obtained for several different datasets and they confirm a very good performance of the newly proposed method.


Author(s):  
Xianjin Shi ◽  
Wanwan Wang ◽  
Chongsheng Zhang

Over the past few decades, a great many data clustering algorithms have been developed, including K-Means, DBSCAN, Bi-Clustering and Spectral clustering, etc. In recent years, two new data  clustering algorithms have been proposed, which are affinity propagation (AP, 2007) and density peak based clustering (DP, 2014). In this work, we empirically compare the performance of these two latest data clustering algorithms with state-of-the-art, using 6 external and 2 internal clustering validation metrics. Our experimental results on 16 public datasets show that, the two latest clustering algorithms, AP and DP, do not always outperform DBSCAN. Therefore, to find the best clustering algorithm for a specific dataset, all of AP, DP and DBSCAN should be considered.  Moreover, we find that the comparison of different clustering algorithms is closely related to the clustering evaluation metrics adopted. For instance, when using the Silhouette clustering validation metric, the overall performance of K-Means is as good as AP and DP. This work has important reference values for researchers and engineers who need to select appropriate clustering algorithms for their specific applications.


Author(s):  
Ahmed M. Serdah ◽  
Wesam M. Ashour

Abstract Traditional clustering algorithms are no longer suitable for use in data mining applications that make use of large-scale data. There have been many large-scale data clustering algorithms proposed in recent years, but most of them do not achieve clustering with high quality. Despite that Affinity Propagation (AP) is effective and accurate in normal data clustering, but it is not effective for large-scale data. This paper proposes two methods for large-scale data clustering that depend on a modified version of AP algorithm. The proposed methods are set to ensure both low time complexity and good accuracy of the clustering method. Firstly, a data set is divided into several subsets using one of two methods random fragmentation or K-means. Secondly, subsets are clustered into K clusters using K-Affinity Propagation (KAP) algorithm to select local cluster exemplars in each subset. Thirdly, the inverse weighted clustering algorithm is performed on all local cluster exemplars to select well-suited global exemplars of the whole data set. Finally, all the data points are clustered by the similarity between all global exemplars and each data point. Results show that the proposed clustering method can significantly reduce the clustering time and produce better clustering result in a way that is more effective and accurate than AP, KAP, and HAP algorithms.


2018 ◽  
Vol 20 (6) ◽  
pp. 2316-2326 ◽  
Author(s):  
Taiyun Kim ◽  
Irene Rui Chen ◽  
Yingxin Lin ◽  
Andy Yi-Yang Wang ◽  
Jean Yee Hwa Yang ◽  
...  

Abstract Advances in high-throughput sequencing on single-cell gene expressions [single-cell RNA sequencing (scRNA-seq)] have enabled transcriptome profiling on individual cells from complex samples. A common goal in scRNA-seq data analysis is to discover and characterise cell types, typically through clustering methods. The quality of the clustering therefore plays a critical role in biological discovery. While numerous clustering algorithms have been proposed for scRNA-seq data, fundamentally they all rely on a similarity metric for categorising individual cells. Although several studies have compared the performance of various clustering algorithms for scRNA-seq data, currently there is no benchmark of different similarity metrics and their influence on scRNA-seq data clustering. Here, we compared a panel of similarity metrics on clustering a collection of annotated scRNA-seq datasets. Within each dataset, a stratified subsampling procedure was applied and an array of evaluation measures was employed to assess the similarity metrics. This produced a highly reliable and reproducible consensus on their performance assessment. Overall, we found that correlation-based metrics (e.g. Pearson’s correlation) outperformed distance-based metrics (e.g. Euclidean distance). To test if the use of correlation-based metrics can benefit the recently published clustering techniques for scRNA-seq data, we modified a state-of-the-art kernel-based clustering algorithm (SIMLR) using Pearson’s correlation as a similarity measure and found significant performance improvement over Euclidean distance on scRNA-seq data clustering. These findings demonstrate the importance of similarity metrics in clustering scRNA-seq data and highlight Pearson’s correlation as a favourable choice. Further comparison on different scRNA-seq library preparation protocols suggests that they may also affect clustering performance. Finally, the benchmarking framework is available at http://www.maths.usyd.edu.au/u/SMS/bioinformatics/software.html.


Sign in / Sign up

Export Citation Format

Share Document