A survey on parallel clustering algorithms for Big Data

PARSUC: A Parallel Subsampling-Based Method for Clustering Remote Sensing Big Data

Sensors ◽

10.3390/s19153438 ◽

2019 ◽

Vol 19 (15) ◽

pp. 3438 ◽

Cited By ~ 3

Author(s):

Xia ◽

Huang ◽

Li ◽

Zhou ◽

Zhang

Keyword(s):

Remote Sensing ◽

Big Data ◽

Clustering Algorithm ◽

Clustering Algorithms ◽

Image Data ◽

Data Partitioning ◽

Data Mining Technique ◽

Mining Technique ◽

Hadoop Platform ◽

Parallel Clustering

Remote sensing big data (RSBD) is generally characterized by huge volumes, diversity, and high dimensionality. Mining hidden information from RSBD for different applications imposes significant computational challenges. Clustering is an important data mining technique widely used in processing and analyzing remote sensing imagery. However, conventional clustering algorithms are designed for relatively small datasets. When applied to problems with RSBD, they are, in general, too slow or inefficient for practical use. In this paper, we proposed a parallel subsampling-based clustering (PARSUC) method for improving the performance of RSBD clustering in terms of both efficiency and accuracy. PARSUC leverages a novel subsampling-based data partitioning (SubDP) method to realize three-step parallel clustering, effectively solving the notable performance bottleneck of the existing parallel clustering algorithms; that is, they must cope with numerous repeated calculations to get a reasonable result. Furthermore, we propose a centroid filtering algorithm (CFA) to eliminate subsampling errors and to guarantee the accuracy of the clustering results. PARSUC was implemented on a Hadoop platform by using the MapReduce parallel model. Experiments conducted on massive remote sensing imageries with different sizes showed that PARSUC (1) provided much better accuracy than conventional remote sensing clustering algorithms in handling larger image data; (2) achieved notable scalability with increased computing nodes added; and (3) spent much less time than the existing parallel clustering algorithm in handling RSBD.

Download Full-text

A Performance Comparison of Big Data Processing Platform Based on Parallel Clustering Algorithms

Procedia Computer Science ◽

10.1016/j.procs.2018.10.228 ◽

2018 ◽

Vol 139 ◽

pp. 127-135 ◽

Cited By ~ 2

Author(s):

Mo Hai ◽

Yuejing Zhang ◽

Haifeng Li

Keyword(s):

Big Data ◽

Data Processing ◽

Clustering Algorithms ◽

Performance Comparison ◽

Big Data Processing ◽

Processing Platform ◽

Parallel Clustering ◽

A Performance

Download Full-text

A Survey of Parallel Clustering Algorithms Based on Spark

Scientific Programming ◽

10.1155/2020/8884926 ◽

2020 ◽

Vol 2020 ◽

pp. 1-12

Author(s):

Wen Xiao ◽

Juan Hu

Keyword(s):

Machine Learning ◽

Image Processing ◽

Big Data ◽

Information Retrieval ◽

Social Network ◽

Clustering Algorithms ◽

Future Research ◽

Parallel Design ◽

Learning Tasks ◽

Parallel Clustering

Clustering is one of the most important unsupervised machine learning tasks, which is widely used in information retrieval, social network analysis, image processing, and other fields. With the explosive growth of data, the classical clustering algorithms cannot meet the requirements of clustering for big data. Spark is one of the most popular parallel processing platforms for big data, and many researchers have proposed many parallel clustering algorithms based on Spark. In this paper, the existing parallel clustering algorithms based on Spark are classified and summarized, the parallel design framework of each kind of algorithms is discussed, and after comparing different kinds of algorithms, the direction of the future research is discussed.

Download Full-text

An Introduction to Clustering Algorithms in Big Data

Encyclopedia of Information Science and Technology, Fifth Edition - Advances in Information Quality and Management ◽

10.4018/978-1-7998-3479-3.ch040 ◽

2021 ◽

pp. 559-576

Author(s):

Rajit Nair ◽

Amit Bhagat

Keyword(s):

Big Data ◽

Single Machine ◽

Data Clustering ◽

Clustering Algorithms ◽

Time Limit ◽

Computation Cost ◽

Different Types ◽

Clustering Approach ◽

Future Path ◽

Parallel Clustering

In big data, clustering is the process through which analysis is performed. Since the data is big, it is very difficult to perform clustering approach. Big data is mainly termed as petabytes and zeta bytes of data and high computation cost is needed for the implementation of clusters. In this chapter, the authors show how clustering can be performed on big data and what are the different types of clustering approach. The challenge during clustering approach is to find observations within the time limit. The chapter also covers the possible future path for more advanced clustering algorithms. The chapter will cover single machine clustering and multiple machines clustering, which also includes parallel clustering.

Download Full-text

Survey on Partition based Clustering Algorithms in Big Data

International Journal of Computer Sciences and Engineering ◽

10.26438/ijcse/v5i12.323325 ◽

2017 ◽

Vol 5 (12) ◽

pp. 323-325

Author(s):

E. Mahima Jane ◽

◽

E. George Dharma Prakash Raj

Keyword(s):

Big Data ◽

Clustering Algorithms

Download Full-text

A review on density-based clustering algorithms for big data analysis

2017 International Conference on I-SMAC (IoT in Social, Mobile, Analytics and Cloud) (I-SMAC) ◽

10.1109/i-smac.2017.8058322 ◽

2017 ◽

Cited By ~ 4

Author(s):

K. Shyam Sunder Reddy ◽

C. Shoba Bindu

Keyword(s):

Big Data ◽

Data Analysis ◽

Clustering Algorithms ◽

Big Data Analysis ◽

Density Based Clustering

Download Full-text

Parallel Clustering Algorithms for Image Processing on Multi-core CPUs

2008 International Conference on Computer Science and Software Engineering ◽

10.1109/csse.2008.1018 ◽

2008 ◽

Cited By ~ 11

Author(s):

Honggang Wang ◽

Jide Zhao ◽

Hongguang Li ◽

Jianguo Wang

Keyword(s):

Image Processing ◽

Clustering Algorithms ◽

Parallel Clustering

Download Full-text

A Parallel Clustering Algorithm for Power Big Data Analysis

Communications in Computer and Information Science - Parallel Architecture, Algorithm and Programming ◽

10.1007/978-981-10-6442-5_51 ◽

2017 ◽

pp. 533-540

Author(s):

Xiangjun Meng ◽

Liang Chen ◽

Yidong Li

Keyword(s):

Big Data ◽

Data Analysis ◽

Clustering Algorithm ◽

Big Data Analysis ◽

Parallel Clustering

Download Full-text

Big Data Mining Based on Computational Intelligence and Fuzzy Clustering

Handbook of Research on Trends and Future Directions in Big Data and Web Intelligence - Advances in Data Mining and Database Management ◽

10.4018/978-1-4666-8505-5.ch007 ◽

2015 ◽

pp. 130-148 ◽

Cited By ~ 2

Author(s):

Usman Akhtar ◽

Mehdi Hassan

Keyword(s):

Big Data ◽

Computational Intelligence ◽

Clustering Algorithms ◽

Heterogeneous Data ◽

Computational Time ◽

Statistical Features ◽

Feature Sets ◽

Clustering Quality ◽

The Given ◽

Different Sources

The availability of a huge amount of heterogeneous data from different sources to the Internet has been termed as the problem of Big Data. Clustering is widely used as a knowledge discovery tool that separate the data into manageable parts. There is a need of clustering algorithms that scale on big databases. In this chapter we have explored various schemes that have been used to tackle the big databases. Statistical features have been extracted and most important and relevant features have been extracted from the given dataset. Reduce and irrelevant features have been eliminated and most important features have been selected by genetic algorithms (GA).Clustering with reduced feature sets requires lower computational time and resources. Experiments have been performed at standard datasets and results indicate that the proposed scheme based clustering offers high clustering accuracy. To check the clustering quality various quality measures have been computed and it has been observed that the proposed methodology results improved significantly. It has been observed that the proposed technique offers high quality clustering.

Download Full-text

Big Data Mining Based on Computational Intelligence and Fuzzy Clustering

Web Services ◽

10.4018/978-1-5225-7501-6.ch024 ◽

2019 ◽

pp. 413-430

Author(s):

Usman Akhtar ◽

Mehdi Hassan

Keyword(s):

Big Data ◽

Computational Intelligence ◽

Clustering Algorithms ◽

Heterogeneous Data ◽

Computational Time ◽

Statistical Features ◽

Feature Sets ◽

Clustering Quality ◽

The Given ◽

Different Sources

The availability of a huge amount of heterogeneous data from different sources to the Internet has been termed as the problem of Big Data. Clustering is widely used as a knowledge discovery tool that separate the data into manageable parts. There is a need of clustering algorithms that scale on big databases. In this chapter we have explored various schemes that have been used to tackle the big databases. Statistical features have been extracted and most important and relevant features have been extracted from the given dataset. Reduce and irrelevant features have been eliminated and most important features have been selected by genetic algorithms (GA). Clustering with reduced feature sets requires lower computational time and resources. Experiments have been performed at standard datasets and results indicate that the proposed scheme based clustering offers high clustering accuracy. To check the clustering quality various quality measures have been computed and it has been observed that the proposed methodology results improved significantly. It has been observed that the proposed technique offers high quality clustering.

Download Full-text