Applying the K-Means Algorithm in Big Raw Data Sets with Hadoop and MapReduce

2013 ◽

pp. 23-46

Author(s):

Ilias K. Savvas ◽

Georgia N. Sofianidou ◽

M-Tahar Kechadi

Keyword(s):

Big Data ◽

Clustering Algorithm ◽

File System ◽

Large Data ◽

Large Data Sets ◽

Distributed File System ◽

Data Sets ◽

Raw Data ◽

Hadoop Distributed File System ◽

Access To Data

Big data refers to data sets whose size is beyond the capabilities of most current hardware and software technologies. The Apache Hadoop software library is a framework for distributed processing of large data sets, while HDFS is a distributed file system that provides high-throughput access to data-driven applications, and MapReduce is software framework for distributed computing of large data sets. Huge collections of raw data require fast and accurate mining processes in order to extract useful knowledge. One of the most popular techniques of data mining is the K-means clustering algorithm. In this study, the authors develop a distributed version of the K-means algorithm using the MapReduce framework on the Hadoop Distributed File System. The theoretical and experimental results of the technique prove its efficiency; thus, HDFS and MapReduce can apply to big data with very promising results.

Download Full-text

Big Data Clustering and Hadoop Distributed File System Architecture

Journal of Computational and Theoretical Nanoscience ◽

10.1166/jctn.2019.8256 ◽

2019 ◽

Vol 16 (9) ◽

pp. 3824-3829

Author(s):

Deepak Ahlawat ◽

Deepali Gupta

Keyword(s):

Big Data ◽

Clustering Algorithm ◽

File System ◽

Early Stage ◽

Large Data ◽

Data File ◽

Distributed File System ◽

Hadoop Distributed File System ◽

Data Files ◽

Technological World

Due to advancement in the technological world, there is a great surge in data. The main sources of generating such a large amount of data are social websites, internet sites etc. The large data files are combined together to create a big data architecture. Managing the data file in such a large volume is not easy. Therefore, modern techniques are developed to manage bulk data. To arrange and utilize such big data, Hadoop Distributed File System (HDFS) architecture from Hadoop was presented in the early stage of 2015. This architecture is used when traditional methods are insufficient to manage the data. In this paper, a novel clustering algorithm is implemented to manage a large amount of data. The concepts and frames of Big Data are studied. A novel algorithm is developed using the K means and cosine-based similarity clustering in this paper. The developed clustering algorithm is evaluated using the precision and recall parameters. The prominent results are obtained which successfully manages the big data issue.

Download Full-text

Big Data Performance Analysis on a Hadoop Distributed File System Based on Modified Partitional Clustering Algorithm

Sustainable Communication Networks and Application - Lecture Notes on Data Engineering and Communications Technologies ◽

10.1007/978-3-030-34515-0_48 ◽

2019 ◽

pp. 461-468

Author(s):

V. Santhana Marichamy ◽

V. Natarajan

Keyword(s):

Big Data ◽

Performance Analysis ◽

Clustering Algorithm ◽

File System ◽

Distributed File System ◽

Partitional Clustering ◽

Hadoop Distributed File System

Download Full-text

A Study on Security Approaches for Big Data Hadoop Distributed File System

Journal of Engineering and Applied Sciences ◽

10.36478/jeasci.2019.8266.8272 ◽

2019 ◽

Vol 14 (22) ◽

pp. 8266-8272

Author(s):

Leelavathi . ◽

M. Elshayeb

Keyword(s):

Big Data ◽

File System ◽

Distributed File System ◽

Hadoop Distributed File System

Download Full-text

Summary of Affinity Propagation

Advanced Materials Research ◽

10.4028/www.scientific.net/amr.268-270.811 ◽

2011 ◽

Vol 268-270 ◽

pp. 811-816

Author(s):

Yong Zhou ◽

Yan Xing

Keyword(s):

Clustering Algorithm ◽

Large Data ◽

Large Data Sets ◽

Affinity Propagation ◽

Damping Factor ◽

Data Sets ◽

Similarity Matrix ◽

Data Points

Affinity Propagation(AP)is a new clustering algorithm, which is based on the similarity matrix between pairs of data points and messages are exchanged between data points until clustering result emerges. It is efficient and fast , and it can solve the clustering on large data sets. But the traditional Affinity Propagation has many limitations, this paper introduces the Affinity Propagation, and analyzes in depth the advantages and limitations of it, focuses on the improvements of the algorithm — improve the similarity matrix, adjust the preference and the damping-factor, combine with other algorithms. Finally, discusses the development of Affinity Propagation.

Download Full-text

Sensing Big Data: Multimodal Information Interfaces for Exploration of Large Data Sets

Big Data at Work ◽

10.4324/9781315780504-12 ◽

2015 ◽

pp. 172-192

Keyword(s):

Big Data ◽

Large Data ◽

Large Data Sets ◽

Data Sets ◽

Multimodal Information

Download Full-text

A Detailed Study on Classification Algorithms in Big Data

Big Data Analytics for Sustainable Computing - Advances in Data Mining and Database Management ◽

10.4018/978-1-5225-9750-6.ch002 ◽

2020 ◽

pp. 30-46

Author(s):

Saranya N. ◽

Saravana Selvam

Keyword(s):

Big Data ◽

Random Forest ◽

Linear Regression ◽

Comprehensive Evaluation ◽

Large Data ◽

Large Data Sets ◽

Data Sets ◽

Classification Methods ◽

Computing Science ◽

Data Collections

After an era of managing data collection difficulties, these days the issue has turned into the problem of how to process these vast amounts of information. Scientists, as well as researchers, think that today, probably the most essential topic in computing science is Big Data. Big Data is used to clarify the huge volume of data that could exist in any structure. This makes it difficult for standard controlling approaches for mining the best possible data through such large data sets. Classification in Big Data is a procedure of summing up data sets dependent on various examples. There are distinctive classification frameworks which help us to classify data collections. A few methods that discussed in the chapter are Multi-Layer Perception Linear Regression, C4.5, CART, J48, SVM, ID3, Random Forest, and KNN. The target of this chapter is to provide a comprehensive evaluation of classification methods that are in effect commonly utilized.

Download Full-text

Uncertainty-Based Clustering Algorithms for Large Data Sets

Modern Technologies for Big Data Classification and Clustering - Advances in Data Mining and Database Management ◽

10.4018/978-1-5225-2805-0.ch001 ◽

2018 ◽

pp. 1-33 ◽

Cited By ~ 1

Author(s):

B. K. Tripathy ◽

Hari Seetha ◽

M. N. Murty

Keyword(s):

Big Data ◽

Data Clustering ◽

Clustering Algorithms ◽

Large Data ◽

Large Data Sets ◽

Mining Machine ◽

Data Sets ◽

Fuzzy C Means ◽

Intuitionistic Fuzzy ◽

New Algorithms

Data clustering plays a very important role in Data mining, machine learning and Image processing areas. As modern day databases have inherent uncertainties, many uncertainty-based data clustering algorithms have been developed in this direction. These algorithms are fuzzy c-means, rough c-means, intuitionistic fuzzy c-means and the means like rough fuzzy c-means, rough intuitionistic fuzzy c-means which base on hybrid models. Also, we find many variants of these algorithms which improve them in different directions like their Kernelised versions, possibilistic versions, and possibilistic Kernelised versions. However, all the above algorithms are not effective on big data for various reasons. So, researchers have been trying for the past few years to improve these algorithms in order they can be applied to cluster big data. The algorithms are relatively few in comparison to those for datasets of reasonable size. It is our aim in this chapter to present the uncertainty based clustering algorithms developed so far and proposes a few new algorithms which can be developed further.

Download Full-text

Electronic Records Management - An Old Solution to a New Problem

Big Data ◽

10.4018/978-1-4666-9840-6.ch102 ◽

2016 ◽

pp. 2249-2274

Author(s):

Chinh Nguyen ◽

Rosemary Stockdale ◽

Helana Scheepers ◽

Jason Sargent

Keyword(s):

Big Data ◽

Rapid Development ◽

Large Data ◽

Large Data Sets ◽

Electronic Records ◽

Future Research ◽

Records Management ◽

Data Sets ◽

Interactive Nature ◽

Electronic Records Management

The rapid development of technology and interactive nature of Government 2.0 (Gov 2.0) is generating large data sets for Government, resulting in a struggle to control, manage, and extract the right information. Therefore, research into these large data sets (termed Big Data) has become necessary. Governments are now spending significant finances on storing and processing vast amounts of information because of the huge proliferation and complexity of Big Data and a lack of effective records management. On the other hand, there is a method called Electronic Records Management (ERM), for controlling and governing the important data of an organisation. This paper investigates the challenges identified from reviewing the literature for Gov 2.0, Big Data, and ERM in order to develop a better understanding of the application of ERM to Big Data to extract useable information in the context of Gov 2.0. The paper suggests that a key building block in providing useable information to stakeholders could potentially be ERM with its well established governance policies. A framework is constructed to illustrate how ERM can play a role in the context of Gov 2.0. Future research is necessary to address the specific constraints and expectations placed on governments in terms of data retention and use.

Download Full-text

Radar Emission Sources Identification Based on Hierarchical Agglomerative Clustering for Large Data Sets

Journal of Sensors ◽

10.1155/2016/1879327 ◽

2016 ◽

Vol 2016 ◽

pp. 1-9 ◽

Cited By ~ 21

Author(s):

Janusz Dudczyk

Keyword(s):

Clustering Algorithm ◽

Large Data ◽

Large Data Sets ◽

Emission Sources ◽

Data Sets ◽

Agglomerative Clustering ◽

Distinctive Features ◽

Identification Process ◽

Hierarchical Agglomerative Clustering ◽

Repetition Interval

More advanced recognition methods, which may recognize particular copies of radars of the same type, are called identification. The identification process of radar devices is a more specialized task which requires methods based on the analysis of distinctive features. These features are distinguished from the signals coming from the identified devices. Such a process is called Specific Emitter Identification (SEI). The identification of radar emission sources with the use of classic techniques based on the statistical analysis of basic measurable parameters of a signal such as Radio Frequency, Amplitude, Pulse Width, or Pulse Repetition Interval is not sufficient for SEI problems. This paper presents the method of hierarchical data clustering which is used in the process of radar identification. The Hierarchical Agglomerative Clustering Algorithm (HACA) based on Generalized Agglomerative Scheme (GAS) implemented and used in the research method is parameterized; therefore, it is possible to compare the results. The results of clustering are presented in dendrograms in this paper. The received results of grouping and identification based on HACA are compared with other SEI methods in order to assess the degree of their usefulness and effectiveness for systems of ESM/ELINT class.

Download Full-text