Parallel Implementation of Improved K-Means Based on a Cloud Platform

Abstract Traditional clustering algorithms are no longer suitable for use in data mining applications that make use of large-scale data. There have been many large-scale data clustering algorithms proposed in recent years, but most of them do not achieve clustering with high quality. Despite that Affinity Propagation (AP) is effective and accurate in normal data clustering, but it is not effective for large-scale data. This paper proposes two methods for large-scale data clustering that depend on a modified version of AP algorithm. The proposed methods are set to ensure both low time complexity and good accuracy of the clustering method. Firstly, a data set is divided into several subsets using one of two methods random fragmentation or K-means. Secondly, subsets are clustered into K clusters using K-Affinity Propagation (KAP) algorithm to select local cluster exemplars in each subset. Thirdly, the inverse weighted clustering algorithm is performed on all local cluster exemplars to select well-suited global exemplars of the whole data set. Finally, all the data points are clustered by the similarity between all global exemplars and each data point. Results show that the proposed clustering method can significantly reduce the clustering time and produce better clustering result in a way that is more effective and accurate than AP, KAP, and HAP algorithms.

Download Full-text

The Research on Large Scale Data Set Clustering Algorithm Based on Tag Set

Communications in Computer and Information Science - Computational Intelligence and Intelligent Systems ◽

10.1007/978-981-10-0356-1_38 ◽

2016 ◽

pp. 365-372

Author(s):

Qiang Chen

Keyword(s):

Large Scale ◽

Clustering Algorithm ◽

Data Set ◽

Large Scale Data ◽

Scale Data

Download Full-text

Part Priority Clustering Algorithm for Large-Scale Data Set

2013 5th International Conference on Intelligent Human-Machine Systems and Cybernetics ◽

10.1109/ihmsc.2013.100 ◽

2013 ◽

Author(s):

Zhihao Yin ◽

Bencheng Yu ◽

Zhifeng Wang ◽

Wang Ran

Keyword(s):

Large Scale ◽

Clustering Algorithm ◽

Data Set ◽

Large Scale Data ◽

Scale Data

Download Full-text

MapReduce-Based Crow Search-Adopted Partitional Clustering Algorithms for Handling Large-Scale Data

International Journal of Cognitive Informatics and Natural Intelligence ◽

10.4018/ijcini.20211001.oa32 ◽

2021 ◽

Vol 15 (4) ◽

pp. 1-23

Author(s):

Karthikeyani Visalakshi N. ◽

Shanthi S. ◽

Lakshmi K.

Keyword(s):

Large Scale ◽

Parallel Implementation ◽

Search Algorithm ◽

Clustering Algorithms ◽

Computation Time ◽

Global Optimum ◽

Data Mining Technique ◽

Local Optima ◽

Large Scale Data ◽

Scale Data

Cluster analysis is the prominent data mining technique in knowledge discovery and it discovers the hidden patterns from the data. The K-Means, K-Modes and K-Prototypes are partition based clustering algorithms and these algorithms select the initial centroids randomly. Because of its random selection of initial centroids, these algorithms provide the local optima in solutions. To solve these issues, the strategy of Crow Search algorithm is employed with these algorithms to obtain the global optimum solution. With the advances in information technology, the size of data increased in a drastic manner from terabytes to petabytes. To make proposed algorithms suitable to handle these voluminous data, the phenomena of parallel implementation of these clustering algorithms with Hadoop Mapreduce framework. The proposed algorithms are experimented with large scale data and the results are compared in terms of cluster evaluation measures and computation time with the number of nodes.

Download Full-text

MapReduce Based Crow Search Adopted Partitional Clustering Algorithms For Handling Large Scale Data

International Journal of Cognitive Informatics and Natural Intelligence ◽

10.4018/ijcini.20211001oa19 ◽

2021 ◽

Vol 15 (4) ◽

pp. 0-0

Keyword(s):

Large Scale ◽

Parallel Implementation ◽

Search Algorithm ◽

Clustering Algorithms ◽

Computation Time ◽

Global Optimum ◽

Data Mining Technique ◽

Local Optima ◽

Large Scale Data ◽

Scale Data

Cluster analysis is the prominent data mining technique in knowledge discovery and it discovers the hidden patterns from the data. The K-Means, K-Modes and K-Prototypes are partition based clustering algorithms and these algorithms select the initial centroids randomly. Because of its random selection of initial centroids, these algorithms provide the local optima in solutions. To solve these issues, the strategy of Crow Search algorithm is employed with these algorithms to obtain the global optimum solution. With the advances in information technology, the size of data increased in a drastic manner from terabytes to petabytes. To make proposed algorithms suitable to handle these voluminous data, the phenomena of parallel implementation of these clustering algorithms with Hadoop Mapreduce framework. The proposed algorithms are experimented with large scale data and the results are compared in terms of cluster evaluation measures and computation time with the number of nodes.

Download Full-text

Study of Map-Reduce over Hadoop Based Cloud Computing Environment

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.509.175 ◽

2014 ◽

Vol 509 ◽

pp. 175-181

Author(s):

Wu Min Pan ◽

Li Bai Ha

Keyword(s):

Cloud Computing ◽

Data Processing ◽

Large Scale ◽

Programming Model ◽

Sql Server ◽

Map Reduce ◽

Data Set ◽

Large Scale Data ◽

Large Scale Data Processing ◽

Scale Data

Popularity for the term Cloud-Computing has been increasing in recent years. In addition to the SQL technique, Map-Reduce, a programming model that realizes implementing large-scale data processing, has been a hot topic that is widely discussed through many studies. Many real-world tasks such as data processing for search engines can be parallel-implemented through a simple interface with two functions called Map and Reduce. We focus on comparing the performance of the Hadoop implementation of Map-Reduce with SQL Server through simulations. Hadoop can complete the same query faster than SQL Server. On the other hand, some concerned factors are also tested to see whether they would affect the performance for Hadoop or not. In fact more machines included for data processing can make Hadoop achieve a better performance, especially for a large-scale data set.

Download Full-text

Affinity propagation clustering algorithm based on large-scale data-set

International Journal of Computers and Applications ◽

10.1080/1206212x.2018.1425184 ◽

2018 ◽

Vol 40 (3) ◽

pp. 1-6 ◽

Cited By ~ 4

Author(s):

Limin Wang ◽

Kaiyue Zheng ◽

Xing Tao ◽

Xuming Han

Keyword(s):

Large Scale ◽

Clustering Algorithm ◽

Affinity Propagation ◽

Data Set ◽

Large Scale Data ◽

Affinity Propagation Clustering ◽

Scale Data

Download Full-text

Probability of large-scale data set EM clustering algorithms based on partial information constraints

Proceedings of the 2016 2nd Workshop on Advanced Research and Technology in Industry Applications ◽

10.2991/wartia-16.2016.346 ◽

2016 ◽

Author(s):

Xiao yan Liu

Keyword(s):

Large Scale ◽

Partial Information ◽

Clustering Algorithms ◽

Data Set ◽

Large Scale Data ◽

Information Constraints ◽

Em Clustering ◽

Scale Data

Download Full-text

Landmark FN-DBSCAN: An Efficient Density-Based Clustering Algorithm with Fuzzy Neighborhood

Journal of Advanced Computational Intelligence and Intelligent Informatics ◽

10.20965/jaciii.2013.p0060 ◽

2013 ◽

Vol 17 (1) ◽

pp. 60-73

Author(s):

Hao Liu ◽

◽

Satoshi Oyama ◽

Masahito Kurihara ◽

Haruhiko Sato

Keyword(s):

Time Complexity ◽

Large Scale ◽

Clustering Algorithm ◽

Data Sets ◽

Clustering Methods ◽

Data Set ◽

Large Scale Data ◽

Density Based Clustering ◽

Scale Data ◽

Large Scale Data Sets

Clustering is an important tool for data analysis and many clustering techniques have been proposed over the past years. Among them are density-based clustering methods, which have several benefits such as the number of clusters is not required before carrying out clustering; the detected clusters can be represented in an arbitrary shape and outliers can be detected and removed. Recently, the density-based algorithms were extended with the fuzzy set theory, which has made these algorithm more robust. However, the density-based clustering algorithms usually require a time complexity ofO(n2) wherenis the number of data in the data set, implying that they are not suitable to work with large scale data sets. In this paper, a novel clustering algorithm called landmark fuzzy neighborhood DBSCAN (landmark FN-DBSCAN) is proposed. The concept, landmark, is used to represent a subset of the input data set which makes the algorithm efficient on large scale data sets. We give a theoretical analysis on time complexity and space complexity, which shows both of them are linear to the size of the data set. The experiments show that the landmark FN-DBSCAN is much faster than FN-DBSCAN and provides a very good quality of clustering.

Download Full-text

ProGen:Provenance database generator for large-scale data set

Journal of Computer Applications ◽

10.3724/sp.j.1087.2008.02737 ◽

2009 ◽

Vol 28 (11) ◽

pp. 2737-2740

Author(s):

Xiao ZHANG ◽

Shan WANG ◽

Na LIAN

Keyword(s):

Large Scale ◽

Data Set ◽

Large Scale Data ◽

Scale Data

Download Full-text