Hadoop History and Architecture

Advances in Data Mining and Database Management - Big Data Processing With Hadoop ◽

10.4018/978-1-5225-3790-8.ch003 ◽

2018 ◽

pp. 32-44

Keyword(s):

File System ◽

Distributed File System ◽

Web Crawler ◽

Text Search ◽

Apache Hadoop ◽

Hadoop Cluster ◽

Apache Software Foundation

As the name indicates, this chapter explains the evolution of Hadoop. Doug Cutting started a text search library called Lucene. After joining Apache Software Foundation, he modified it into a web crawler called Apache Nutch. Then Google File System was taken as reference and modified as Nutch Distributed File System. Then Google's MapReduce features were also integrated and Hadoop was framed. The whole path from Lucene to Apache Hadoop is illustrated in this chapter. Also, the different versions of Hadoop are explained. The procedure to download the software is explained. The mechanism to verify the downloaded software is shown. Then the architecture of Hadoop is detailed. The Hadoop cluster is a set of commodity machines grouped together. The arrangement of Hadoop machines in different racks is shown. After reading this chapter, the reader will understand how Hadoop has evolved and its entire architecture.

Download Full-text

Política Customizada de Balanceamento de Réplicas para o HDFS Balancer do Apache Hadoop

10.5753/wtf.2019.7717 ◽

2019 ◽

Author(s):

Rhauani W. Fazul ◽

Patricia Pitthan Barcelos

Keyword(s):

File System ◽

Data Replication ◽

Distributed File System ◽

Apache Hadoop ◽

Hadoop Distributed File System ◽

Fundamental Mechanism ◽

The Way

Data replication is a fundamental mechanism of the Hadoop Distributed File System (HDFS). However, the way data is spread across the cluster directly affects the replication balancing. The HDFS Balancer is a Hadoop integrated tool which can balance the storage load on each machine by moving data between nodes, although its operation does not address the specific needs of applications while performing block rearrangement. This paper proposes a customized balancing policy for HDFS Balancer based on a system of priorities, which can be adapted and configured according to usage demands. The priorities define whether HDFS parameters, or whether cluster topology should be considered during the operation, thus making the balancing more flexible.

Download Full-text

Identification of Threats and Vulnerabilities in Public Cloud-Based Apache Hadoop Distributed File System

2019 15th International Computer Engineering Conference (ICENCO) ◽

10.1109/icenco48310.2019.9027300 ◽

2019 ◽

Author(s):

Omar Hussein

Keyword(s):

File System ◽

Distributed File System ◽

Public Cloud ◽

Apache Hadoop ◽

Hadoop Distributed File System

Download Full-text

Cloud Forensics : Isolating Cloud Instance

International Research Journal of Electronics and Computer Engineering ◽

10.24178/irjece.2017.3.2.10 ◽

2017 ◽

Vol 3 (2) ◽

pp. 10

Author(s):

Mariam J. AlKandari ◽

Huda F. Al Rasheedi ◽

Ayed A. Salman

Keyword(s):

Cloud Computing ◽

Digital Forensics ◽

File System ◽

File Systems ◽

Distributed File System ◽

Cloud Environment ◽

Apache Hadoop ◽

Cloud Forensics ◽

Hadoop Distributed File System ◽

Do So

Abstract—Cloud computing has been the trending model for storing, accessing and modifying the data over the Internet in the recent years. Rising use of the cloud has generated a new concept related to the cloud which is cloud forensics. Cloud forensics can be defined as investigating for evidence over the cloud, so it can be viewed as a combination of both cloud computing and digital forensics. Many issues of applying forensics in the cloud have been addressed. Isolating the location of the incident has become an essential part of forensic process. This is done to ensure that evidence will not be modified or changed. Isolating an instant in the cloud computing has become even more challenging, due to the nature of the cloud environment. In the cloud, the same storage or virtual machine have been used by many users. Hence, the evidence is most likely will be overwritten and lost. The proposed solution in this paper is to isolate a cloud instance. This can be achieved by marking the instant that reside in the servers as "Under Investigation". To do so, cloud file system must be studied. One of the well-known file systems used in the cloud is Apache Hadoop Distributed File System (HDFS). Thus, in this paper the methodology used for isolating a cloud instance would be based on the HDFS architecture. Keywords: cloud computing; digital forensics; cloud forensics

Download Full-text