A Robust Distributed Clustering of Large Data Sets on a Grid of Commodity Machines

Salah Taamneh; Mo’taz Al-Hami; Hani Bani-Salameh; Alaa E. Abdallah

doi:10.3390/data6070073

A Robust Distributed Clustering of Large Data Sets on a Grid of Commodity Machines

Data ◽

10.3390/data6070073 ◽

2021 ◽

Vol 6 (7) ◽

pp. 73

Author(s):

Salah Taamneh ◽

Mo’taz Al-Hami ◽

Hani Bani-Salameh ◽

Alaa E. Abdallah

Keyword(s):

Fault Tolerant ◽

Clustering Algorithms ◽

Large Data ◽

High Rate ◽

Data Sets ◽

Distributed Clustering ◽

Actor Model ◽

Concurrency Model ◽

Continuous Progress ◽

Replication Technique

Distributed clustering algorithms have proven to be effective in dramatically reducing execution time. However, distributed environments are characterized by a high rate of failure. Nodes can easily become unreachable. Furthermore, it is not guaranteed that messages are delivered to their destination. As a result, fault tolerance mechanisms are of paramount importance to achieve resiliency and guarantee continuous progress. In this paper, a fault-tolerant distributed k-means algorithm is proposed on a grid of commodity machines. Machines in such an environment are connected in a peer-to-peer fashion and managed by a gossip protocol with the actor model used as the concurrency model. The fact that no synchronization is needed makes it a good fit for parallel processing. Using the passive replication technique for the leader node and the active replication technique for the workers, the system exhibited robustness against failures. The results showed that the distributed k-means algorithm with no fault-tolerant mechanisms achieved up to a 34% improvement over the Hadoop-based k-means algorithm, while the robust one achieved up to a 12% improvement. The experiments also showed that the overhead, using such techniques, was negligible. Moreover, the results indicated that losing up to 10% of the messages had no real impact on the overall performance.

Download Full-text

Uncertainty-Based Clustering Algorithms for Large Data Sets

Modern Technologies for Big Data Classification and Clustering - Advances in Data Mining and Database Management ◽

10.4018/978-1-5225-2805-0.ch001 ◽

2018 ◽

pp. 1-33 ◽

Cited By ~ 1

Author(s):

B. K. Tripathy ◽

Hari Seetha ◽

M. N. Murty

Keyword(s):

Big Data ◽

Data Clustering ◽

Clustering Algorithms ◽

Large Data ◽

Large Data Sets ◽

Mining Machine ◽

Data Sets ◽

Fuzzy C Means ◽

Intuitionistic Fuzzy ◽

New Algorithms

Data clustering plays a very important role in Data mining, machine learning and Image processing areas. As modern day databases have inherent uncertainties, many uncertainty-based data clustering algorithms have been developed in this direction. These algorithms are fuzzy c-means, rough c-means, intuitionistic fuzzy c-means and the means like rough fuzzy c-means, rough intuitionistic fuzzy c-means which base on hybrid models. Also, we find many variants of these algorithms which improve them in different directions like their Kernelised versions, possibilistic versions, and possibilistic Kernelised versions. However, all the above algorithms are not effective on big data for various reasons. So, researchers have been trying for the past few years to improve these algorithms in order they can be applied to cluster big data. The algorithms are relatively few in comparison to those for datasets of reasonable size. It is our aim in this chapter to present the uncertainty based clustering algorithms developed so far and proposes a few new algorithms which can be developed further.

Download Full-text

Models for Internal Clustering Validation Indexes Based on Hadoop-MapReduce

International Journal of Distributed Systems and Technologies ◽

10.4018/ijdst.2020070103 ◽

2020 ◽

Vol 11 (3) ◽

pp. 42-67

Author(s):

Soumeya Zerabi ◽

Souham Meshoul ◽

Samia Chikhi Boucherkha

Keyword(s):

Clustering Algorithms ◽

Large Data ◽

Optimal Number ◽

Data Sets ◽

Data Set ◽

Number Of Clusters ◽

Distributed Models ◽

Hadoop Mapreduce ◽

Distributed Solutions ◽

Clustering Validation

Cluster validation aims to both evaluate the results of clustering algorithms and predict the number of clusters. It is usually achieved using several indexes. Traditional internal clustering validation indexes (CVIs) are mainly based in computing pairwise distances which results in a quadratic complexity of the related algorithms. The existing CVIs cannot handle large data sets properly and need to be revisited to take account of the ever-increasing data set volume. Therefore, design of parallel and distributed solutions to implement these indexes is required. To cope with this issue, the authors propose two parallel and distributed models for internal CVIs namely for Silhouette and Dunn indexes using MapReduce framework under Hadoop. The proposed models termed as MR_Silhouette and MR_Dunn have been tested to solve both the issue of evaluating the clustering results and identifying the optimal number of clusters. The results of experimental study are very promising and show that the proposed parallel and distributed models achieve the expected tasks successfully.

Download Full-text

CLUSTERING CATEGORICAL AND NUMERICAL DATA: A NEW PROCEDURE USING MULTIDIMENSIONAL SCALING

International Journal of Information Technology & Decision Making ◽

10.1142/s0219622003000549 ◽

2003 ◽

Vol 02 (01) ◽

pp. 135-159 ◽

Cited By ~ 12

Author(s):

SUNG-GI LEE ◽

DEOK-KYUN YUN

Keyword(s):

Multidimensional Scaling ◽

Clustering Algorithms ◽

Numerical Data ◽

Large Data ◽

Careful Analysis ◽

Mixed Data ◽

Coordinate Space ◽

Data Sets ◽

Categorical Attributes ◽

Categorical Attribute

In this paper, we present a concept based on the similarity of categorical attribute values considering implicit relationships and propose a new and effective clustering procedure for mixed data. Our procedure obtains similarities between categorical values from careful analysis and maps the values in each categorical attribute into points in two-dimensional coordinate space using multidimensional scaling. These mapped values make it possible to interpret the relationships between attribute values and to directly apply categorical attributes to clustering algorithms using a Euclidean distance. After trivial modifications, our procedure for clustering mixed data uses the k-means algorithm, well known for its efficiency in clustering large data sets. We use the familiar soybean disease and adult data sets to demonstrate the performance of our clustering procedure. The satisfactory results that we have obtained demonstrate the effectiveness of our algorithm in discovering structure in data.

Download Full-text

Low-Rank Matrix Factorization and Co-clustering Algorithms for Analyzing Large Data Sets

Lecture Notes in Computer Science - Data Engineering and Management ◽

10.1007/978-3-642-27872-3_41 ◽

2012 ◽

pp. 272-279 ◽

Cited By ~ 2

Author(s):

Archana Donavalli ◽

Manjeet Rege ◽

Xumin Liu ◽

Kourosh Jafari-Khouzani

Keyword(s):

Matrix Factorization ◽

Clustering Algorithms ◽

Large Data ◽

Large Data Sets ◽

Low Rank ◽

Data Sets ◽

Rank Matrix ◽

Low Rank Matrix

Download Full-text

Searching for Pulsating Stars Using Clustering Algorithms

Proceedings of the International Astronomical Union ◽

10.1017/s1743921318002855 ◽

2017 ◽

Vol 14 (S339) ◽

pp. 310-313

Author(s):

R. Kgoadi ◽

I. Whittingham ◽

C. Engelbrecht

Keyword(s):

Clustering Algorithms ◽

Large Data ◽

Relevant Information ◽

Variable Stars ◽

Data Sets ◽

Specific Class ◽

Pulsating Stars ◽

Expectation Maximisation ◽

Input Variables ◽

Physical Features

AbstractClustering algorithms constitute a multi-disciplinary analytical tool commonly used to summarise large data sets. Astronomical classifications are based on similarity, where celestial objects are assigned to a specific class according to specific physical features. The aim of this project is to obtain relevant information from high-dimensional data (at least three input variables in a data-frame) derived from stellar light-curves using a number of clustering algorithms such as K-means and Expectation Maximisation. In addition to identifying the best performing algorithm, we also identify a subset of features that best define stellar groups. Three methodologies are applied to a sample of Kepler time series in the temperature range 6500–19,000 K. In that spectral range, at least four classes of variable stars are expected to be found: δ Scuti, γ Doradus, Slowly Pulsating B (SPB), and (the still equivocal) Maia stars.

Download Full-text

Empirical comparison of fast partitioning-based clustering algorithms for large data sets

Expert Systems with Applications ◽

10.1016/s0957-4174(02)00185-9 ◽

2003 ◽

Vol 24 (4) ◽

pp. 351-363 ◽

Cited By ~ 27

Author(s):

Chih-Ping Wei ◽

Yen-Hsien Lee ◽

Che-Ming Hsu

Keyword(s):

Clustering Algorithms ◽

Large Data ◽

Large Data Sets ◽

Data Sets ◽

Empirical Comparison

Download Full-text

A Novel Approach for Clustering Big Data based on MapReduce

International Journal of Electrical and Computer Engineering (IJECE) ◽

10.11591/ijece.v8i3.pp1711-1719 ◽

2018 ◽

Vol 8 (3) ◽

pp. 1711 ◽

Cited By ~ 1

Author(s):

Gourav Bathla ◽

Himanshu Aggarwal ◽

Rinkle Rani

Keyword(s):

Big Data ◽

Categorical Data ◽

Large Scale ◽

Clustering Algorithms ◽

Numerical Data ◽

Large Data ◽

Data Sets ◽

Single Node ◽

Novel Approach ◽

Network Analytics

Clustering is one of the most important applications of data mining. It has attracted attention of researchers in statistics and machine learning. It is used in many applications like information retrieval, image processing and social network analytics etc. It helps the user to understand the similarity and dissimilarity between objects. Cluster analysis makes the users understand complex and large data sets more clearly. There are different types of clustering algorithms analyzed by various researchers. Kmeans is the most popular partitioning based algorithm as it provides good results because of accurate calculation on numerical data. But Kmeans give good results for numerical data only. Big data is combination of numerical and categorical data. Kprototype algorithm is used to deal with numerical as well as categorical data. Kprototype combines the distance calculated from numeric and categorical data. With the growth of data due to social networking websites, business transactions, scientific calculation etc., there is vast collection of structured, semi-structured and unstructured data. So, there is need of optimization of Kprototype so that these varieties of data can be analyzed efficiently.In this work, Kprototype algorithm is implemented on MapReduce in this paper. Experiments have proved that Kprototype implemented on Mapreduce gives better performance gain on multiple nodes as compared to single node. CPU execution time and speedup are used as evaluation metrics for comparison.Intellegent splitter is proposed in this paper which splits mixed big data into numerical and categorical data. Comparison with traditional algorithms proves that proposed algorithm works better for large scale of data.

Download Full-text

Clustering Mixed Incomplete Data

Heuristic and Optimization for Knowledge Discovery ◽

10.4018/978-1-930708-26-6.ch006 ◽

2011 ◽

pp. 89-106

Author(s):

Jose Ruiz-Shulcloper ◽

Guillermo Sanchez-Diaz ◽

Mongi A. Abidi

Keyword(s):

Pattern Recognition ◽

Incomplete Data ◽

Clustering Algorithms ◽

Large Data ◽

Unsupervised Classification ◽

Large Data Sets ◽

Data Sets ◽

Classification Problems ◽

Clustering Method ◽

Combinatorial Pattern

In this chapter, we expose the possibilities of the Logical Combinatorial Pattern Recognition (LCPR) tools for Clustering Large and Very Large Mixed Incomplete Data (MID) Sets. We start from the real existence of a number of complex structures of large or very large data sets. Our research is directed towards the application of methods, techniques and in general, the philosophy of the LCPR to the solution of supervised and unsupervised classification problems. In this chapter, we introduce the GLC and DGLC clustering algorithms and the GLC+ clustering method in order to process large and very large mixed incomplete data sets.

Download Full-text

Clustering techniques and their applications in engineering

Proceedings of the Institution of Mechanical Engineers Part C Journal of Mechanical Engineering Science ◽

10.1243/09544062jmes508 ◽

2007 ◽

Vol 221 (11) ◽

pp. 1445-1459 ◽

Cited By ~ 19

Author(s):

D T Pham ◽

A A Afify

Keyword(s):

Data Mining ◽

Clustering Algorithms ◽

Large Data ◽

Large Data Sets ◽

Data Sets ◽

Monitoring And Control ◽

Design Quality ◽

Clustering Problem ◽

Manufacturing System Design ◽

And Control

Clustering is an important data exploration technique with many applications in different areas of engineering, including engineering design, manufacturing system design, quality assurance, production planning and process planning, modelling, monitoring, and control. The clustering problem has been addressed by researchers from many disciplines. However, efforts to perform effective and efficient clustering on large data sets only started in recent years with the emergence of data mining. The current paper presents an overview of clustering algorithms from a data mining perspective. Attention is paid to techniques of scaling up these algorithms to handle large data sets. The paper also describes a number of engineering applications to illustrate the potential of clustering algorithms as a tool for handling complex real-world problems.

Download Full-text

Spatial Modification in the Parameters of Mountain Image Clustering Algorithm

Al-Nahrain Journal for Engineering Sciences ◽

10.29194/njes.22010055 ◽

2019 ◽

Vol 22 (1) ◽

pp. 55-58

Author(s):

Nahla Ibraheem Jabbar

Keyword(s):

Clustering Algorithm ◽

Spatial Information ◽

Clustering Algorithms ◽

Large Data ◽

Large Data Sets ◽

Image Clustering ◽

Optimum Number ◽

Data Sets ◽

Number Of Clusters ◽

Pixel Value

Our proposed method used to overcome the drawbacks of computing values parameters in the mountain algorithm to image clustering. All existing clustering algorithms are required values of parameters to starting the clustering process such as these algorithms have a big problem in computing parameters. One of the famous clustering is a mountain algorithm that gives expected number of clusters, we presented in this paper a new modification of mountain clustering called Spatial Modification in the Parameters of Mountain Image Clustering Algorithm. This modification in the spatial information of image by taking a window mask for each center pixel value to compute distance between pixel and neighborhood for estimation the values of parameters σ, β that gives a potential optimum number of clusters requiring in image segmentation process. Our experiments show ability the proposed algorithm in image brain segmentation with a quality in the large data sets

Download Full-text