scholarly journals Implementation of Parallelized K-means and K-Medoids++ Clustering Algorithms on Hadoop Map Reduce Framework

The electronic information from online newspapers, journals, conference proceedings website pages and emails are growing rapidly which are generating huge amount of data. Data grouping has been gotten impressive consideration in numerous applications. The size of data is raised exponentially due to the advancement of innovation and development, makes clustering of vast size of information, a challenging issue. With the end goal to manage the issue, numerous scientists endeavor to outline productive parallel clustering representations to be needed in algorithms of hadoop. In this paper, we show the implementation of parallelized K-Means and parallelized K-Medoids algorithms for clustering an large data objects file based on MapReduce for grouping huge information. The proposed algorithms combines initialization algorithm with Map Reduce framework to reduce the number of iterations and it can scale well with the commodity hardware as the efficient process for large dataset processing. The outcome of this paper shows the implementation of each algorithms.

2017 ◽  
Vol 7 (1.3) ◽  
pp. 37
Author(s):  
Joy Christy A.

Data mining refers to the extraction of meaningful knowledge from large data sources as it may contain hidden potential facts. In general the analysis of data mining can either be predictive or descriptive. Predictive analysis of data mining interprets the inference of the existing results so as to identify the future outputs and the descriptive analysis of data mining interprets the intrinsic characteristics or nature of the data. Clustering is one of the descriptive analysis techniques of data mining which groups the objects of similar types in such a way that objects in a cluster are closer to each other than the objects of other clusters.  K-means is the most popular and widely used clustering algorithm that starts by selecting the k-random initial centroids as equal to number of clusters given by the user. It then computes the distance between initial centroids with the remaining data objects and groups the data objects into the cluster centroids with minimum distance. This process is repeated until there is no change in the cluster centroids or cluster members. But, still k-means has been suffered from several issues such as optimum number of k, random initial centroids, unknown number of iterations, global optimum solutions of clusters and more importantly the creation of meaningful clusters when dealing with the analysis of datasets from various domains. The accuracy involved with clustering should never be compromised. Thus, in this paper, a novel classification via clustering algorithm called Iterative Linear Regression Clustering with Percentage Split Distribution (ILRCPSD) is introduced as an alternate solution to the problems encountered in traditional clustering algorithms. The proposed algorithm is examined over an educational dataset to identify the hidden group of students having similar cognitive and competency skills.  The performance of the proposed algorithm is well-compared with the accuracy of the traditional k-means clustering in terms of building meaningful clusters and to prove its real time usefulness.


Author(s):  
Amrit Pal ◽  
Manish Kumar

Size of data is increasing, it is creating challenges for its processing and storage. There are cluster based techniques available for storage and processing of this huge amount of data. Map Reduce provides an effective programming framework for developing distributed program for performing tasks which results in terms of key value pair. Collaborative filtering is the process of performing recommendation based on the previous rating of the user for a particular item or service. There are challenges while implementing collaborative filtering techniques using these distributed models. Some techniques are available for implementing collaborative filtering techniques using these models. Cluster based collaborative filtering, map reduce based collaborative filtering are some of these techniques. Chapter addresses these techniques and some basics of collaborative filtering.


2020 ◽  
Vol 1 (1) ◽  
pp. 55-74
Author(s):  
Hiroaki Shiokawa ◽  
Tomohiro Matsushita ◽  
Hiroyuki Kitagawa

Affinity Propagation is one of the fundamental clustering algorithms used in various Web-based systems and applications. Although Affinity Propagation finds highly accurate clusters, it is computationally expensive to apply Affinity Propagation to a large dataset since it requires iterative computations for all possible pairs of data objects in the dataset. To address the aforementioned issue, this paper presents efficient Affinity Propagation algorithms, namely \textit{C-AP}. In order to increase the clustering speed, C-AP employs \textit{cell-based index} to reduce the number of the computed data object pairs in the clustering procedure. By using the cell-based index, C-AP efficiently detects unnecessary pairs, which do not contribute to its clustering result. For further reducing the computation time, we also present an extension of our algorithm named \textit{Parallel C-AP} that utilizes thread-parallelization techniques. As a result, C-AP and Parallel C-AP detects the same clusters as those of Affinity Propagation with much shorter computation time. Extensive evaluations demonstrate the performance superiority of our proposed algorithms over the state-of-the-art algorithms.


Author(s):  
Raymond Greenlaw ◽  
Sanpawat Kantabutra

This chapter provides the reader with an introduction to clustering algorithms and applications. A number of important well-known clustering methods are surveyed. The authors present a brief history of the development of the field of clustering, discuss various types of clustering, and mention some of the current research directions in the field of clustering. Algorithms are described for top-down and bottom-up hierarchical clustering, as are algorithms for K-Means clustering and for K-Medians clustering. The technique of representative points is also presented. Given the large data sets involved with clustering, the need to apply parallel computing to clustering arises, so they discuss issues related to parallel clustering as well. Throughout the chapter references are provided to works that contain a large number of experimental results. A comparison of the various clustering methods is given in tabular format. They conclude the chapter with a summary and an extensive list of references.


2013 ◽  
Vol 5 (2) ◽  
pp. 136-143 ◽  
Author(s):  
Astha Mehra ◽  
Sanjay Kumar Dubey

In today’s world data is produced every day at a phenomenal rate and we are required to store this ever growing data on almost daily basis. Even though our ability to store this huge data has grown but the problem lies when users expect sophisticated information from this data. This can be achieved by uncovering the hidden information from the raw data, which is the purpose of data mining.  Data mining or knowledge discovery is the computer-assisted process of digging through and analyzing enormous set of data and then extracting the meaning out of it. The raw and unlabeled data present in large databases can be classified initially in an unsupervised manner by making use of cluster analysis. Clustering analysis is the process of finding the groups of objects such that the objects in a group will be similar to one another and dissimilar from the objects in other groups. These groups are known as clusters.  In other words, clustering is the process of organizing the data objects in groups whose members have some similarity among them. Some of the applications of clustering are in marketing -finding group of customers with similar behavior, biology- classification of plants and animals given their features, data analysis, and earthquake study -observe earthquake epicenter to identify dangerous zones, WWW -document classification, etc. The results or outcome and efficiency of clustering process is generally identified though various clustering algorithms. The aim of this research paper is to compare two important clustering algorithms namely centroid based K-means and X-means. The performance of the algorithms is evaluated in different program execution on the same input dataset. The performance of these algorithms is analyzed and compared on the basis of quality of clustering outputs, number of iterations and cut-off factors.


2013 ◽  
Vol 12 (5) ◽  
pp. 3443-3451
Author(s):  
Rajesh Pasupuleti ◽  
Narsimha Gugulothu

Clustering analysis initiatives  a new direction in data mining that has major impact in various domains including machine learning, pattern recognition, image processing, information retrieval and bioinformatics. Current clustering techniques address some of the  requirements not adequately and failed in standardizing clustering algorithms to support for all real applications. Many clustering methods mostly depend on user specified parametric methods and initial seeds of clusters are randomly selected by  user.  In this paper, we proposed new clustering method based on linear approximation of function by getting over all idea of behavior knowledge of clustering function, then pick the initial seeds of clusters as the points on linear approximation line and perform clustering operations, unlike grouping data objects into clusters by using distance measures, similarity measures and statistical distributions in traditional clustering methods. We have shown experimental results as clusters based on linear approximation yields good  results in practice with an example of  business data are provided.  It also  explains privacy preserving clusters of sensitive data objects.


Author(s):  
B. K. Tripathy ◽  
Hari Seetha ◽  
M. N. Murty

Data clustering plays a very important role in Data mining, machine learning and Image processing areas. As modern day databases have inherent uncertainties, many uncertainty-based data clustering algorithms have been developed in this direction. These algorithms are fuzzy c-means, rough c-means, intuitionistic fuzzy c-means and the means like rough fuzzy c-means, rough intuitionistic fuzzy c-means which base on hybrid models. Also, we find many variants of these algorithms which improve them in different directions like their Kernelised versions, possibilistic versions, and possibilistic Kernelised versions. However, all the above algorithms are not effective on big data for various reasons. So, researchers have been trying for the past few years to improve these algorithms in order they can be applied to cluster big data. The algorithms are relatively few in comparison to those for datasets of reasonable size. It is our aim in this chapter to present the uncertainty based clustering algorithms developed so far and proposes a few new algorithms which can be developed further.


2020 ◽  
Vol 11 (3) ◽  
pp. 42-67
Author(s):  
Soumeya Zerabi ◽  
Souham Meshoul ◽  
Samia Chikhi Boucherkha

Cluster validation aims to both evaluate the results of clustering algorithms and predict the number of clusters. It is usually achieved using several indexes. Traditional internal clustering validation indexes (CVIs) are mainly based in computing pairwise distances which results in a quadratic complexity of the related algorithms. The existing CVIs cannot handle large data sets properly and need to be revisited to take account of the ever-increasing data set volume. Therefore, design of parallel and distributed solutions to implement these indexes is required. To cope with this issue, the authors propose two parallel and distributed models for internal CVIs namely for Silhouette and Dunn indexes using MapReduce framework under Hadoop. The proposed models termed as MR_Silhouette and MR_Dunn have been tested to solve both the issue of evaluating the clustering results and identifying the optimal number of clusters. The results of experimental study are very promising and show that the proposed parallel and distributed models achieve the expected tasks successfully.


Sign in / Sign up

Export Citation Format

Share Document