Implementation of Parallelized K-means and K-Medoids++ Clustering Algorithms on Hadoop Map Reduce Framework

The electronic information from online newspapers, journals, conference proceedings website pages and emails are growing rapidly which are generating huge amount of data. Data grouping has been gotten impressive consideration in numerous applications. The size of data is raised exponentially due to the advancement of innovation and development, makes clustering of vast size of information, a challenging issue. With the end goal to manage the issue, numerous scientists endeavor to outline productive parallel clustering representations to be needed in algorithms of hadoop. In this paper, we show the implementation of parallelized K-Means and parallelized K-Medoids algorithms for clustering an large data objects file based on MapReduce for grouping huge information. The proposed algorithms combines initialization algorithm with Map Reduce framework to reduce the number of iterations and it can scale well with the commodity hardware as the efficient process for large dataset processing. The outcome of this paper shows the implementation of each algorithms.

Download Full-text

An advanced ilrcpsd technique for bridging the competency and cognitive skills of students in higher education

International Journal of Engineering & Technology ◽

10.14419/ijet.v7i1.3.8984 ◽

2017 ◽

Vol 7 (1.3) ◽

pp. 37

Author(s):

Joy Christy A.

Keyword(s):

Data Mining ◽

Cognitive Skills ◽

Clustering Algorithm ◽

Descriptive Analysis ◽

Clustering Algorithms ◽

Large Data ◽

Global Optimum ◽

Optimum Number ◽

Data Objects ◽

Alternate Solution

Data mining refers to the extraction of meaningful knowledge from large data sources as it may contain hidden potential facts. In general the analysis of data mining can either be predictive or descriptive. Predictive analysis of data mining interprets the inference of the existing results so as to identify the future outputs and the descriptive analysis of data mining interprets the intrinsic characteristics or nature of the data. Clustering is one of the descriptive analysis techniques of data mining which groups the objects of similar types in such a way that objects in a cluster are closer to each other than the objects of other clusters. K-means is the most popular and widely used clustering algorithm that starts by selecting the k-random initial centroids as equal to number of clusters given by the user. It then computes the distance between initial centroids with the remaining data objects and groups the data objects into the cluster centroids with minimum distance. This process is repeated until there is no change in the cluster centroids or cluster members. But, still k-means has been suffered from several issues such as optimum number of k, random initial centroids, unknown number of iterations, global optimum solutions of clusters and more importantly the creation of meaningful clusters when dealing with the analysis of datasets from various domains. The accuracy involved with clustering should never be compromised. Thus, in this paper, a novel classification via clustering algorithm called Iterative Linear Regression Clustering with Percentage Split Distribution (ILRCPSD) is introduced as an alternate solution to the problems encountered in traditional clustering algorithms. The proposed algorithm is examined over an educational dataset to identify the hidden group of students having similar cognitive and competency skills. The performance of the proposed algorithm is well-compared with the accuracy of the traditional k-means clustering in terms of building meaningful clusters and to prove its real time usefulness.

Download Full-text

Collaborative Filtering Based Data Mining for Large Data

Collaborative Filtering Using Data Mining and Analysis - Advances in Data Mining and Database Management ◽

10.4018/978-1-5225-0489-4.ch006 ◽

2017 ◽

pp. 115-127

Author(s):

Amrit Pal ◽

Manish Kumar

Keyword(s):

Data Mining ◽

Collaborative Filtering ◽

Large Data ◽

Map Reduce ◽

Huge Amount ◽

Distributed Models ◽

Programming Framework ◽

Processing And Storage ◽

And Storage ◽

Filtering Techniques

Size of data is increasing, it is creating challenges for its processing and storage. There are cluster based techniques available for storage and processing of this huge amount of data. Map Reduce provides an effective programming framework for developing distributed program for performing tasks which results in terms of key value pair. Collaborative filtering is the process of performing recommendation based on the previous rating of the user for a particular item or service. There are challenges while implementing collaborative filtering techniques using these distributed models. Some techniques are available for implementing collaborative filtering techniques using these models. Cluster based collaborative filtering, map reduce based collaborative filtering are some of these techniques. Chapter addresses these techniques and some basics of collaborative filtering.

Download Full-text

Fast Affinity Propagation by Cell-based Indexing

journal of Data Intelligence ◽

10.26421/jdi1.1-4 ◽

2020 ◽

Vol 1 (1) ◽

pp. 55-74

Author(s):

Hiroaki Shiokawa ◽

Tomohiro Matsushita ◽

Hiroyuki Kitagawa

Keyword(s):

State Of The Art ◽

Clustering Algorithms ◽

Computation Time ◽

The State ◽

Affinity Propagation ◽

Large Dataset ◽

Web Based ◽

Data Object ◽

Computationally Expensive ◽

Data Objects

Affinity Propagation is one of the fundamental clustering algorithms used in various Web-based systems and applications. Although Affinity Propagation finds highly accurate clusters, it is computationally expensive to apply Affinity Propagation to a large dataset since it requires iterative computations for all possible pairs of data objects in the dataset. To address the aforementioned issue, this paper presents efficient Affinity Propagation algorithms, namely \textit{C-AP}. In order to increase the clustering speed, C-AP employs \textit{cell-based index} to reduce the number of the computed data object pairs in the clustering procedure. By using the cell-based index, C-AP efficiently detects unnecessary pairs, which do not contribute to its clustering result. For further reducing the computation time, we also present an extension of our algorithm named \textit{Parallel C-AP} that utilizes thread-parallelization techniques. As a result, C-AP and Parallel C-AP detects the same clusters as those of Affinity Propagation with much shorter computation time. Extensive evaluations demonstrate the performance superiority of our proposed algorithms over the state-of-the-art algorithms.

Download Full-text

Introduction to Clustering

Dynamic and Advanced Data Mining for Progressing Technological Development ◽

10.4018/978-1-60566-908-3.ch010 ◽

2010 ◽

pp. 224-254

Author(s):

Raymond Greenlaw ◽

Sanpawat Kantabutra

Keyword(s):

Clustering Algorithms ◽

Large Data ◽

Large Data Sets ◽

Data Sets ◽

Clustering Methods ◽

Research Directions ◽

History Of ◽

Representative Points ◽

Parallel Clustering ◽

Extensive List

This chapter provides the reader with an introduction to clustering algorithms and applications. A number of important well-known clustering methods are surveyed. The authors present a brief history of the development of the field of clustering, discuss various types of clustering, and mention some of the current research directions in the field of clustering. Algorithms are described for top-down and bottom-up hierarchical clustering, as are algorithms for K-Means clustering and for K-Medians clustering. The technique of representative points is also presented. Given the large data sets involved with clustering, the need to apply parallel computing to clustering arises, so they discuss issues related to parallel clustering as well. Throughout the chapter references are provided to works that contain a large number of experimental results. A comparison of the various clustering methods is given in tabular format. They conclude the chapter with a summary and an extensive list of references.

Download Full-text

Parallel Clustering Algorithms for Image Processing on Multi-core CPUs

2008 International Conference on Computer Science and Software Engineering ◽

10.1109/csse.2008.1018 ◽

2008 ◽

Cited By ~ 11

Author(s):

Honggang Wang ◽

Jide Zhao ◽

Hongguang Li ◽

Jianguo Wang

Keyword(s):

Image Processing ◽

Clustering Algorithms ◽

Parallel Clustering

Download Full-text

Maintainability Evaluation of Object-Oriented Software System Using Clustering Techniques

INTERNATIONAL JOURNAL OF COMPUTERS & TECHNOLOGY ◽

10.24297/ijct.v5i2.3535 ◽

2013 ◽

Vol 5 (2) ◽

pp. 136-143 ◽

Cited By ~ 1

Author(s):

Astha Mehra ◽

Sanjay Kumar Dubey

Keyword(s):

Clustering Algorithms ◽

Daily Basis ◽

Computer Assisted ◽

Program Execution ◽

Huge Data ◽

Large Databases ◽

Input Dataset ◽

Data Objects

In todayâ€™s world data is produced every day at a phenomenal rate and we are required to store this ever growing data on almost daily basis. Even though our ability to store this huge data has grown but the problem lies when users expect sophisticated information from this data. This can be achieved by uncovering the hidden information from the raw data, which is the purpose of data mining.Â Data mining or knowledge discovery is the computer-assisted process of digging through and analyzing enormous set of data and then extracting the meaning out of it. The raw and unlabeled data present in large databases can be classified initially in an unsupervised manner by making use of cluster analysis. Clustering analysis is the process of finding the groups of objects such that the objects in a group will be similar to one another and dissimilar from the objects in other groups. These groups are known as clusters.Â In other words, clustering is the process of organizing the data objects in groups whose members have some similarity among them. Some of the applications of clustering are in marketing -finding group of customers with similar behavior, biology- classification of plants and animals given their features, data analysis, and earthquake study -observe earthquake epicenter to identify dangerous zones, WWW -document classification, etc. The results or outcome and efficiency of clustering process is generally identified though various clustering algorithms. The aim of this research paper is to compare two important clustering algorithms namely centroid based K-means and X-means. The performance of the algorithms is evaluated in different program execution on the same input dataset. The performance of these algorithms is analyzed and compared on the basis of quality of clustering outputs, number of iterations and cut-off factors.

Download Full-text

PRIVACY PRESERVING CLUSTERING BASED ON LINEAR APPROXIMATION OF FUNCTION

INTERNATIONAL JOURNAL OF COMPUTERS & TECHNOLOGY ◽

10.24297/ijct.v12i5.2914 ◽

2013 ◽

Vol 12 (5) ◽

pp. 3443-3451

Author(s):

Rajesh Pasupuleti ◽

Narsimha Gugulothu

Keyword(s):

Linear Approximation ◽

Clustering Algorithms ◽

Similarity Measures ◽

Privacy Preserving ◽

Distance Measures ◽

Clustering Methods ◽

Sensitive Data ◽

Processing Information ◽

Data Objects ◽

Approximation Of Function

Clustering analysis initiativesÂ a new direction in data mining that has major impact in various domains including machine learning, pattern recognition, image processing, information retrieval and bioinformatics. Current clustering techniques address some of theÂ requirements not adequately and failed in standardizing clustering algorithms to support for all real applications. Many clustering methods mostly depend on user specified parametric methods and initial seeds of clusters are randomly selected byÂ user.Â In this paper, we proposed new clustering method based on linear approximation of function by getting over all idea of behavior knowledge of clustering function, then pick the initial seeds of clusters as the points on linear approximation line and perform clustering operations, unlike grouping data objects into clusters by using distance measures, similarity measures and statistical distributions in traditional clustering methods. We have shown experimental results as clusters based on linear approximation yields goodÂ results in practice with an example ofÂ business data are provided.Â It alsoÂ explains privacy preserving clusters of sensitive data objects.

Download Full-text

Uncertainty-Based Clustering Algorithms for Large Data Sets

Modern Technologies for Big Data Classification and Clustering - Advances in Data Mining and Database Management ◽

10.4018/978-1-5225-2805-0.ch001 ◽

2018 ◽

pp. 1-33 ◽

Cited By ~ 1

Author(s):

B. K. Tripathy ◽

Hari Seetha ◽

M. N. Murty

Keyword(s):

Big Data ◽

Data Clustering ◽

Clustering Algorithms ◽

Large Data ◽

Large Data Sets ◽

Mining Machine ◽

Data Sets ◽

Fuzzy C Means ◽

Intuitionistic Fuzzy ◽

New Algorithms

Data clustering plays a very important role in Data mining, machine learning and Image processing areas. As modern day databases have inherent uncertainties, many uncertainty-based data clustering algorithms have been developed in this direction. These algorithms are fuzzy c-means, rough c-means, intuitionistic fuzzy c-means and the means like rough fuzzy c-means, rough intuitionistic fuzzy c-means which base on hybrid models. Also, we find many variants of these algorithms which improve them in different directions like their Kernelised versions, possibilistic versions, and possibilistic Kernelised versions. However, all the above algorithms are not effective on big data for various reasons. So, researchers have been trying for the past few years to improve these algorithms in order they can be applied to cluster big data. The algorithms are relatively few in comparison to those for datasets of reasonable size. It is our aim in this chapter to present the uncertainty based clustering algorithms developed so far and proposes a few new algorithms which can be developed further.

Download Full-text

Models for Internal Clustering Validation Indexes Based on Hadoop-MapReduce

International Journal of Distributed Systems and Technologies ◽

10.4018/ijdst.2020070103 ◽

2020 ◽

Vol 11 (3) ◽

pp. 42-67

Author(s):

Soumeya Zerabi ◽

Souham Meshoul ◽

Samia Chikhi Boucherkha

Keyword(s):

Clustering Algorithms ◽

Large Data ◽

Optimal Number ◽

Data Sets ◽

Data Set ◽

Number Of Clusters ◽

Distributed Models ◽

Hadoop Mapreduce ◽

Distributed Solutions ◽

Clustering Validation

Cluster validation aims to both evaluate the results of clustering algorithms and predict the number of clusters. It is usually achieved using several indexes. Traditional internal clustering validation indexes (CVIs) are mainly based in computing pairwise distances which results in a quadratic complexity of the related algorithms. The existing CVIs cannot handle large data sets properly and need to be revisited to take account of the ever-increasing data set volume. Therefore, design of parallel and distributed solutions to implement these indexes is required. To cope with this issue, the authors propose two parallel and distributed models for internal CVIs namely for Silhouette and Dunn indexes using MapReduce framework under Hadoop. The proposed models termed as MR_Silhouette and MR_Dunn have been tested to solve both the issue of evaluating the clustering results and identifying the optimal number of clusters. The results of experimental study are very promising and show that the proposed parallel and distributed models achieve the expected tasks successfully.

Download Full-text

A survey on parallel clustering algorithms for Big Data

Artificial Intelligence Review ◽

10.1007/s10462-020-09918-2 ◽

2020 ◽

Author(s):

Zineb Dafir ◽

Yasmine Lamari ◽

Said Chah Slaoui

Keyword(s):

Big Data ◽

Clustering Algorithms ◽

Parallel Clustering

Download Full-text