Big Data Clustering and Hadoop Distributed File System Architecture

Big data refers to data sets whose size is beyond the capabilities of most current hardware and software technologies. The Apache Hadoop software library is a framework for distributed processing of large data sets, while HDFS is a distributed file system that provides high-throughput access to data-driven applications, and MapReduce is software framework for distributed computing of large data sets. Huge collections of raw data require fast and accurate mining processes in order to extract useful knowledge. One of the most popular techniques of data mining is the K-means clustering algorithm. In this study, the authors develop a distributed version of the K-means algorithm using the MapReduce framework on the Hadoop Distributed File System. The theoretical and experimental results of the technique prove its efficiency; thus, HDFS and MapReduce can apply to big data with very promising results.

Download Full-text

Applying the K-Means Algorithm in Big Raw Data Sets with Hadoop and MapReduce

Big Data Management, Technologies, and Applications - Advances in Data Mining and Database Management ◽

10.4018/978-1-4666-4699-5.ch002 ◽

2013 ◽

pp. 23-46

Author(s):

Ilias K. Savvas ◽

Georgia N. Sofianidou ◽

M-Tahar Kechadi

Keyword(s):

Big Data ◽

Clustering Algorithm ◽

File System ◽

Large Data ◽

Large Data Sets ◽

Distributed File System ◽

Data Sets ◽

Raw Data ◽

Hadoop Distributed File System ◽

Access To Data

Big data refers to data sets whose size is beyond the capabilities of most current hardware and software technologies. The Apache Hadoop software library is a framework for distributed processing of large data sets, while HDFS is a distributed file system that provides high-throughput access to data-driven applications, and MapReduce is software framework for distributed computing of large data sets. Huge collections of raw data require fast and accurate mining processes in order to extract useful knowledge. One of the most popular techniques of data mining is the K-means clustering algorithm. In this study, the authors develop a distributed version of the K-means algorithm using the MapReduce framework on the Hadoop Distributed File System. The theoretical and experimental results of the technique prove its efficiency; thus, HDFS and MapReduce can apply to big data with very promising results.

Download Full-text

Big Data Performance Analysis on a Hadoop Distributed File System Based on Modified Partitional Clustering Algorithm

Sustainable Communication Networks and Application - Lecture Notes on Data Engineering and Communications Technologies ◽

10.1007/978-3-030-34515-0_48 ◽

2019 ◽

pp. 461-468

Author(s):

V. Santhana Marichamy ◽

V. Natarajan

Keyword(s):

Big Data ◽

Performance Analysis ◽

Clustering Algorithm ◽

File System ◽

Distributed File System ◽

Partitional Clustering ◽

Hadoop Distributed File System

Download Full-text

A Study on Security Approaches for Big Data Hadoop Distributed File System

Journal of Engineering and Applied Sciences ◽

10.36478/jeasci.2019.8266.8272 ◽

2019 ◽

Vol 14 (22) ◽

pp. 8266-8272

Author(s):

Leelavathi . ◽

M. Elshayeb

Keyword(s):

Big Data ◽

File System ◽

Distributed File System ◽

Hadoop Distributed File System

Download Full-text

Ensure Security for Mapreduce-Hadoop Distributed File System using Encryption Method Over Big Data

International Journal of Innovative Technology and Exploring Engineering - Special Issue ◽

10.35940/ijitee.j8883.0881019 ◽

2019 ◽

Vol 8 (10) ◽

pp. 733-740

Keyword(s):

Big Data ◽

File System ◽

Data Encryption ◽

Previous Method ◽

Distributed File System ◽

Sensitive Data ◽

Encryption And Decryption ◽

Hadoop Distributed File System ◽

Encryption Method ◽

Mapreduce Paradigm

Big data security is the most focused research issue nowadays due to their increased size and the complexity involved in handling of large volume of data. It is more difficult to ensure security on big data handling due to its characteristics 4V’s. With the aim of ensuring security and flexible encryption computation on big data with reduced computation overhead in this work, framework with encryption (MRS) is presented with Hadoop Distributed file System (HDFS). Development of the MapReduce paradigm needs networked attached storage in addition to parallel processing. For storing as well as handling big data, HDFS are extensively utilized. This proposed method creates a framework for obtaining data from client and after that examining the received data, excerpt privacy policy and after that find the sensitive data. The security is guaranteed in this framework using key rotation algorithm which is an efficient encryption and decryption technique for safeguarding the data over big data. Data encryption is a means to protect data in storage with containing a key encryption saved and accessible to reuse the data while required. The outcome shows that the research method guarantees greater security for enormous amount of data and gives beneficial info to related clients. Therefore the outcome concluded that the proposed method is superior to the previous method. Finally, this research can be applied effectively on the various domains such as health care domains, educational domains, social networking domains, etc which require more security and increased volume of data.

Download Full-text

Peer Review #1 of "Attribute based honey encryption algorithm for securing big data: Hadoop distributed file system perspective (v0.3)"

10.7287/peerj-cs.259v0.3/reviews/1 ◽

2020 ◽

Author(s):

P Derbeko

Keyword(s):

Big Data ◽

Peer Review ◽

File System ◽

Encryption Algorithm ◽

Distributed File System ◽

System Perspective ◽

Hadoop Distributed File System

Download Full-text

Discussion on Big Data: TDFS Vs HDFS

International Journal of Recent Technology and Engineering - 2 ◽

10.35940/ijrte.d9497.118419 ◽

2019 ◽

Vol 8 (4) ◽

pp. 10051-10056

Keyword(s):

Big Data ◽

Cost Reduction ◽

File System ◽

Block Size ◽

Distributed File System ◽

Huge Amount ◽

Memory Block ◽

The People ◽

Hadoop Distributed File System ◽

Better Than

In recent years, big data is huge amount of data to uncover hidden attributes. Today’s technologies has possible to analyze the data and get data is almost immediately. Why big data is very important? Because cost reduction, faster, and better decision making using Hadoop. For example a large warehouse of terabytes of data is generated daily from social media’s like Twitter, LinkedIn and Facebook are case of organization in the people to people communication area for big data. Big data has 3 most important challenges of Volume, Variety, and Velocity. In this paper we have studied about the performance of Traditional Distributed File System (TDFS) and Hadoop Distributed File System (HDFS). Benefits of HDFS has support for flume tool in Hadoop comparing with TDFS. Memory block size data retrieving time and security are used as metrics in evaluating the performance of TDFS and HDFS. Result shows HDFC performs better than TDFS in the above metrics and HDFS is more suitable for big data analysis comparing of TDFS.

Download Full-text

Attribute based honey encryption algorithm for securing big data: Hadoop distributed file system perspective

PeerJ Computer Science ◽

10.7717/peerj-cs.259 ◽

2020 ◽

Vol 6 ◽

pp. e259

Author(s):

Gayatri Kapil ◽

Alka Agrawal ◽

Abdulaziz Attaallah ◽

Abdullah Algarni ◽

Rajeev Kumar ◽

...

Keyword(s):

Big Data ◽

File System ◽

Low Cost ◽

Distributed File System ◽

File Size ◽

System Perspective ◽

Huge Data ◽

Attribute Based Encryption ◽

Hadoop Distributed File System ◽

Encryption Decryption

Hadoop has become a promising platform to reliably process and store big data. It provides flexible and low cost services to huge data through Hadoop Distributed File System (HDFS) storage. Unfortunately, absence of any inherent security mechanism in Hadoop increases the possibility of malicious attacks on the data processed or stored through Hadoop. In this scenario, securing the data stored in HDFS becomes a challenging task. Hence, researchers and practitioners have intensified their efforts in working on mechanisms that would protect user’s information collated in HDFS. This has led to the development of numerous encryption-decryption algorithms but their performance decreases as the file size increases. In the present study, the authors have enlisted a methodology to solve the issue of data security in Hadoop storage. The authors have integrated Attribute Based Encryption with the honey encryption on Hadoop, i.e., Attribute Based Honey Encryption (ABHE). This approach works on files that are encoded inside the HDFS and decoded inside the Mapper. In addition, the authors have evaluated the proposed ABHE algorithm by performing encryption-decryption on different sizes of files and have compared the same with existing ones including AES and AES with OTP algorithms. The ABHE algorithm shows considerable improvement in performance during the encryption-decryption of files.

Download Full-text

Hadoop Distributed File System (HDFS)

Advances in Data Mining and Database Management - Big Data Processing With Hadoop ◽

10.4018/978-1-5225-3790-8.ch005 ◽

2018 ◽

pp. 63-89

Keyword(s):

Big Data ◽

File System ◽

Distributed File System ◽

Minimum Amount ◽

Master Node ◽

Cache Size ◽

Hadoop Distributed File System ◽

Work Done

Hadoop Distributed File System, which is popularly known as HDFS, is a Java-based distributed file system running on commodity machines. HDFS is basically meant for storing Big Data over distributed commodity machines and getting the work done at a faster rate due to the processing of data in a distributed manner. Basically, HDFS has one name node (master node) and cluster of data nodes (slave nodes). The HDFS files are divided into blocks. The block is the minimum amount of data (64 MB) that can be read or written. The functions of the name node are to master the slave nodes, to maintain the file system, to control client access, and to have control of the replications. To ensure the availability of the name node, a standby name node is deployed by failover control and fencing is done to avoid the activation of the primary name node during failover. The functions of the data nodes are to store the data, serve the read and write requests, replicate the blocks, maintain the liveness of the node, ensure the storage policy, and maintain the block cache size. Also, it ensures the availability of data.

Download Full-text

Research on parallel data processing of data mining platform in the background of cloud computing

Journal of Intelligent Systems ◽

10.1515/jisys-2020-0113 ◽

2021 ◽

Vol 30 (1) ◽

pp. 479-486

Author(s):

Lingrui Bu ◽

Hui Zhang ◽

Haiyan Xing ◽

Lijun Wu

Keyword(s):

Data Mining ◽

Data Processing ◽

Parallel Algorithm ◽

Large Scale ◽

File System ◽

Large Data ◽

Distributed File System ◽

Data Set ◽

Traditional Algorithm ◽

Hadoop Distributed File System

Abstract The efficient processing of large-scale data has very important practical value. In this study, a data mining platform based on Hadoop distributed file system was designed, and then K-means algorithm was improved with the idea of max-min distance. On Hadoop distributed file system platform, the parallelization was realized by MapReduce. Finally, the data processing effect of the algorithm was analyzed with Iris data set. The results showed that the parallel algorithm divided more correct samples than the traditional algorithm; in the single-machine environment, the parallel algorithm ran longer; in the face of large data sets, the traditional algorithm had insufficient memory, but the parallel algorithm completed the calculation task; the acceleration ratio of the parallel algorithm was raised with the expansion of cluster size and data set size, showing a good parallel effect. The experimental results verifies the reliability of parallel algorithm in big data processing, which makes some contributions to further improve the efficiency of data mining.

Download Full-text