A Robust Distributed Big Data Clustering-based on Adaptive Density Partitioning using Apache Spark

Behrooz Hosseini; Kourosh Kiani

doi:10.3390/sym10080342

A Robust Distributed Big Data Clustering-based on Adaptive Density Partitioning using Apache Spark

Symmetry ◽

10.3390/sym10080342 ◽

2018 ◽

Vol 10 (8) ◽

pp. 342 ◽

Cited By ~ 3

Author(s):

Behrooz Hosseini ◽

Kourosh Kiani

Keyword(s):

Big Data ◽

Data Clustering ◽

Local Density ◽

Apache Spark ◽

Locality Sensitive Hashing ◽

Weighted Averaging ◽

Noise Robustness ◽

Locality Preservation ◽

Density Peaks ◽

Clustering Approach

Unsupervised machine learning and knowledge discovery from large-scale datasets have recently attracted a lot of research interest. The present paper proposes a distributed big data clustering approach-based on adaptive density estimation. The proposed method is developed-based on Apache Spark framework and tested on some of the prevalent datasets. In the first step of this algorithm, the input data is divided into partitions using a Bayesian type of Locality Sensitive Hashing (LSH). Partitioning makes the processing fully parallel and much simpler by avoiding unneeded calculations. Each of the proposed algorithm steps is completely independent of the others and no serial bottleneck exists all over the clustering procedure. Locality preservation also filters out the outliers and enhances the robustness of the proposed approach. Density is defined on the basis of Ordered Weighted Averaging (OWA) distance which makes clusters more homogenous. According to the density of each node, the local density peaks will be detected adaptively. By merging the local peaks, final cluster centers will be obtained and other data points will be a member of the cluster with the nearest center. The proposed method has been implemented and compared with similar recently published researches. Cluster validity indexes achieved from the proposed method shows its superiorities in precision and noise robustness in comparison with recent researches. Comparison with similar approaches also shows superiorities of the proposed method in scalability, high performance, and low computation cost. The proposed method is a general clustering approach and it has been used in gene expression clustering as a sample of its application.

FSS-DKM: A Hybird Big Data Clustering approach using feature subset selection and distance based k-means algorithm

Scientific Transactions in Enviornment and Technovation ◽

10.20894/stet.116.009.003.009 ◽

2016 ◽

Vol 9 (3) ◽

pp. 161-165

Author(s):

D. Anitha ◽

◽

M.V. Srinath ◽

Keyword(s):

Big Data ◽

Data Clustering ◽

Subset Selection ◽

Feature Subset Selection ◽

Feature Subset ◽

Clustering Approach

An Enhanced Hybrid Clustering Approach for Privacy Preservation (ECPS) in Big Data using Apache Spark Framework

Journal of Testing and Evaluation ◽

10.1520/jte20180414 ◽

2019 ◽

Vol 47 (6) ◽

pp. 20180414

Author(s):

Revathy Swaminathan ◽

Arunkumar Thangavelu

Keyword(s):

Big Data ◽

Privacy Preservation ◽

Apache Spark ◽

Hybrid Clustering ◽

Clustering Approach ◽

Spark Framework

An Introduction to Clustering Algorithms in Big Data

Encyclopedia of Information Science and Technology, Fifth Edition - Advances in Information Quality and Management ◽

10.4018/978-1-7998-3479-3.ch040 ◽

2021 ◽

pp. 559-576

Author(s):

Rajit Nair ◽

Amit Bhagat

Keyword(s):

Big Data ◽

Single Machine ◽

Data Clustering ◽

Clustering Algorithms ◽

Time Limit ◽

Computation Cost ◽

Different Types ◽

Clustering Approach ◽

Future Path ◽

Parallel Clustering

In big data, clustering is the process through which analysis is performed. Since the data is big, it is very difficult to perform clustering approach. Big data is mainly termed as petabytes and zeta bytes of data and high computation cost is needed for the implementation of clusters. In this chapter, the authors show how clustering can be performed on big data and what are the different types of clustering approach. The challenge during clustering approach is to find observations within the time limit. The chapter also covers the possible future path for more advanced clustering algorithms. The chapter will cover single machine clustering and multiple machines clustering, which also includes parallel clustering.

Big data clustering techniques based on Spark: a literature review

PeerJ Computer Science ◽

10.7717/peerj-cs.321 ◽

2020 ◽

Vol 6 ◽

pp. e321

Author(s):

Mozamel M. Saeed ◽

Zaher Al Aghbari ◽

Mohammed Alsharidah

Keyword(s):

Big Data ◽

Data Clustering ◽

Apache Spark ◽

Mining Machine ◽

Massive Data ◽

Clustering Methods ◽

Research Directions ◽

Massive Growth ◽

New Research ◽

Massive Data Processing

A popular unsupervised learning method, known as clustering, is extensively used in data mining, machine learning and pattern recognition. The procedure involves grouping of single and distinct points in a group in such a way that they are either similar to each other or dissimilar to points of other clusters. Traditional clustering methods are greatly challenged by the recent massive growth of data. Therefore, several research works proposed novel designs for clustering methods that leverage the benefits of Big Data platforms, such as Apache Spark, which is designed for fast and distributed massive data processing. However, Spark-based clustering research is still in its early days. In this systematic survey, we investigate the existing Spark-based clustering methods in terms of their support to the characteristics Big Data. Moreover, we propose a new taxonomy for the Spark-based clustering methods. To the best of our knowledge, no survey has been conducted on Spark-based clustering of Big Data. Therefore, this survey aims to present a comprehensive summary of the previous studies in the field of Big Data clustering using Apache Spark during the span of 2010–2020. This survey also highlights the new research directions in the field of clustering massive data.

Fractional Fuzzy Clustering and Particle Whale Optimization-Based MapReduce Framework for Big Data Clustering

Journal of Intelligent Systems ◽

10.1515/jisys-2018-0117 ◽

2019 ◽

Vol 29 (1) ◽

pp. 1496-1513 ◽

Cited By ~ 1

Author(s):

Omkaresh Kulkarni ◽

Sudarson Jena ◽

C. H. Sanjay

Keyword(s):

Big Data ◽

Data Clustering ◽

Clustering Algorithm ◽

Clustering Algorithms ◽

Mapreduce Framework ◽

Swarm Optimization ◽

Skin Segmentation ◽

Kernel Clustering ◽

Whale Optimization ◽

Clustering Approach

Abstract The recent advancements in information technology and the web tend to increase the volume of data used in day-to-day life. The result is a big data era, which has become a key issue in research due to the complexity in the analysis of big data. This paper presents a technique called FPWhale-MRF for big data clustering using the MapReduce framework (MRF), by proposing two clustering algorithms. In FPWhale-MRF, the mapper function estimates the cluster centroids using the Fractional Tangential-Spherical Kernel clustering algorithm, which is developed by integrating the fractional theory into a Tangential-Spherical Kernel clustering approach. The reducer combines the mapper outputs to find the optimal centroids using the proposed Particle-Whale (P-Whale) algorithm, for the clustering. The P-Whale algorithm is proposed by combining Whale Optimization Algorithm with Particle Swarm Optimization, for effective clustering such that its performance is improved. Two datasets, namely localization and skin segmentation datasets, are used for the experimentation and the performance is evaluated regarding two performance evaluation metrics: clustering accuracy and DB-index. The maximum accuracy attained by the proposed FPWhale-MRF technique is 87.91% and 90% for the localization and skin segmentation datasets, respectively, thus proving its effectiveness in big data clustering.

Exploring Apache Spark Data APIs for Water Big Data Management

Advances in Intelligent Systems and Computing - Advanced Intelligent Systems for Sustainable Development (AI2SD’2018) ◽

10.1007/978-3-030-11881-5_10 ◽

2019 ◽

pp. 105-117

Author(s):

Nassif El Hassane ◽

Hicham Hajji

Keyword(s):

Big Data ◽

Data Management ◽

Apache Spark

Big data Predictive Analytics for Apache Spark using Machine Learning

2020 Global Conference on Wireless and Optical Technologies (GCWOT) ◽

10.1109/gcwot49901.2020.9391620 ◽

2020 ◽

Author(s):

Muhammad Junaid ◽

Shiraz Ali Wagan ◽

Nawab Muhammad Faseeh Qureshi ◽

Choon Sung Nam ◽

Dong Ryeol Shin

Keyword(s):

Machine Learning ◽

Big Data ◽

Predictive Analytics ◽

Apache Spark

Ensembled Adaptive Fuzzy K-Means With Stochastic Extreme Gradient Boost Big Data Clustering on Geo-Social Networks

2021 International Conference on Advance Computing and Innovative Technologies in Engineering (ICACITE) ◽

10.1109/icacite51222.2021.9404574 ◽

2021 ◽

Author(s):

M. Anoop ◽

P. Sripriya

Keyword(s):

Social Networks ◽

Big Data ◽

Data Clustering ◽

Adaptive Fuzzy

A Distributed Rough Set Theory Algorithm based on Locality Sensitive Hashing for an Efficient Big Data Pre-processing

2018 IEEE International Conference on Big Data (Big Data) ◽

10.1109/bigdata.2018.8622024 ◽

2018 ◽

Cited By ~ 1

Author(s):

Zaineb Chelly Dagdia ◽

Christine Zarges ◽

Gael Beck ◽

Hanene Azzag ◽

Mustapha Lebbah

Keyword(s):

Big Data ◽

Set Theory ◽

Rough Set ◽

Rough Set Theory ◽

Locality Sensitive Hashing

Approx-SMOTE: Fast SMOTE for Big Data on Apache Spark

Neurocomputing ◽

10.1016/j.neucom.2021.08.086 ◽

2021 ◽

Vol 464 ◽

pp. 432-437

Author(s):

Mario Juez-Gil ◽

Álvar Arnaiz-González ◽

Juan J. Rodríguez ◽

Carlos López-Nozal ◽

César García-Osorio

Keyword(s):

Big Data ◽

Apache Spark