Hadoop Setup

Apache Hadoop is an open source framework for storage and processing massive amounts of data. The skeleton of Hadoop can be viewed as distributed computing across a cluster of computers. This chapter deals with the single node, multinode setup of Hadoop environment along with the Hadoop user commands and administration commands. Hadoop processes the data on a cluster of machines with commodity hardware. It has two components, Hadoop Distributed File System for storage and Map Reduce/YARN for processing. Single node processing can be done through standalone or pseudo-distributed mode whereas multinode is through cluster mode. The execution procedure for each environment is briefly stated. Then the chapter explores the Hadoop user commands for operations like copying to and from files in distributed file systems, running jar, creating archive, setting version, classpath, etc. Further, Hadoop administration manages the configuration including functions like cluster balance, running the dfs, MapReduce admin, namenode, secondary namenode, etc.

With the advent of IoT, number of IOT-devices are deployed in the city to acquisition data. These devices acquire enormous data and to analyze such data one need to configure novel hardware to scale up the existing servers and need to develop an application with précised framework. This work recommends an adapted scale out approach in which huge multi-dimensional datasets can be processed using existing commodity hardware. In this approach, Hadoop Distributed File System (HDFS) holds the huge multi-dimensional data to be processed and it can be processed and analyzed by using MapReduce (MR) framework. In the proposed approach, we implemented an optimized repartitioned K-Means centroid based partitioning clustering algorithm using MR framework for Smart City dataset. This dataset contains 10 million objects and each object has six attributes. The results show that the proposed approach is a scalable approach to compute intra cluster density and inter cluster density effectively.


2019 ◽  
Vol 15 (S367) ◽  
pp. 464-466
Author(s):  
Paul Bartus

AbstractDuring the last years, the amount of data has skyrocketed. As a consequence, the data has become more expensive to store than to generate. The storage needs for astronomical data are also following this trend. Storage systems in Astronomy contain redundant copies of data such as identical files or within sub-file regions. We propose the use of the Hadoop Distributed and Deduplicated File System (HD2FS) in Astronomy. HD2FS is a deduplication storage system that was created to improve data storage capacity and efficiency in distributed file systems without compromising Input/Output performance. HD2FS can be developed by modifying existing storage system environments such as the Hadoop Distributed File System. By taking advantage of deduplication technology, we can better manage the underlying redundancy of data in astronomy and reduce the space needed to store these files in the file systems, thus allowing for more capacity per volume.


2018 ◽  
Vol 7 (2.6) ◽  
pp. 221
Author(s):  
O Achandair ◽  
S Bourekkadi ◽  
E Elmahouti ◽  
S Khoulji ◽  
M L. Kerkeb

Hadoop Distributed File System (HDFS) is designed to reliably store very large files across machines in a large cluster. It is one of the most used distributed file systems and offer a high availability and scalability on low-cost hardware. All Hadoopframework have HDFS as their storage component. Coupled with map reduce, which is the processing component, HDFS and Map Reduce (a processing component) have become the standard platforms for any management of big data in these days. HDFS however, in terms of design has the ability to handle huge numbers of large files,  but when it comes to its deployments to handle large amounts of small files it might not be very effective. This paper puts forward a new strategy of managing small files. The approach will consists of two principal phases. The first phase will deal with the consolidating of aaclients input files, storing it continuously in a particular allocated block, that is a SequenceFile format, and so on into the next blocks. In this way we avoid the use of multiple block allocations for different streams, this reduces calls for available blocks and also reduces the metadata memory on the NameNode. Note the reason for this is that groups of small files packaged in a SequenceFile on the same block require one entry instead of one of each small file. The second phase will involve analyzing the attributes of stored small files so they can be distributed them in a way that the most called files will be referenced by an additional index as a MapFile format to reduce the read throughput during random access. 


Author(s):  
Mariam J. AlKandari ◽  
Huda F. Al Rasheedi ◽  
Ayed A. Salman

Abstract—Cloud computing has been the trending model for storing, accessing and modifying the data over the Internet in the recent years. Rising use of the cloud has generated a new concept related to the cloud which is cloud forensics. Cloud forensics can be defined as investigating for evidence over the cloud, so it can be viewed as a combination of both cloud computing and digital forensics. Many issues of applying forensics in the cloud have been addressed. Isolating the location of the incident has become an essential part of forensic process. This is done to ensure that evidence will not be modified or changed.  Isolating an instant in the cloud computing has become even more challenging, due to the nature of the cloud environment. In the cloud, the same storage or virtual machine have been used by many users. Hence, the evidence is most likely will be overwritten and lost. The proposed solution in this paper is to isolate a cloud instance. This can be achieved by marking the instant that reside in the servers as "Under Investigation". To do so, cloud file system must be studied. One of the well-known file systems used in the cloud is Apache Hadoop Distributed File System (HDFS). Thus, in this paper the methodology used for isolating a cloud instance would be based on the HDFS architecture. Keywords: cloud computing; digital forensics; cloud forensics


2019 ◽  
Author(s):  
Rhauani W. Fazul ◽  
Patricia Pitthan Barcelos

Data replication is a fundamental mechanism of the Hadoop Distributed File System (HDFS). However, the way data is spread across the cluster directly affects the replication balancing. The HDFS Balancer is a Hadoop integrated tool which can balance the storage load on each machine by moving data between nodes, although its operation does not address the specific needs of applications while performing block rearrangement. This paper proposes a customized balancing policy for HDFS Balancer based on a system of priorities, which can be adapted and configured according to usage demands. The priorities define whether HDFS parameters, or whether cluster topology should be considered during the operation, thus making the balancing more flexible.


2018 ◽  
Vol 210 ◽  
pp. 04042
Author(s):  
Ammar Alhaj Ali ◽  
Pavel Varacha ◽  
Said Krayem ◽  
Roman Jasek ◽  
Petr Zacek ◽  
...  

Nowadays, a wide set of systems and application, especially in high performance computing, depends on distributed environments to process and analyses huge amounts of data. As we know, the amount of data increases enormously, and the goal to provide and develop efficient, scalable and reliable storage solutions has become one of the major issue for scientific computing. The storage solution used by big data systems is Distributed File Systems (DFSs), where DFS is used to build a hierarchical and unified view of multiple file servers and shares on the network. In this paper we will offer Hadoop Distributed File System (HDFS) as DFS in big data systems and we will present an Event-B as formal method that can be used in modeling, where Event-B is a mature formal method which has been widely used in a number of industry projects in a number of domains, such as automotive, transportation, space, business information, medical device and so on, And will propose using the Rodin as modeling tool for Event-B, which integrates modeling and proving as well as the Rodin platform is open source, so it supports a large number of plug-in tools.


2020 ◽  
Vol 7 (1) ◽  
Author(s):  
Gousiya Begum ◽  
S. Zahoor Ul Huq ◽  
A. P. Siva Kumar

Abstract Extensive usage of Internet based applications in day to day life has led to generation of huge amounts of data every minute. Apart from humans, data is generated by machines like sensors, satellite, CCTV etc. This huge collection of heterogeneous data is often referred as Big Data which can be processed to draw useful insights. Apache Hadoop has emerged has widely used open source software framework for Big Data Processing and it is a cluster of cooperative computers enabling distributed parallel processing. Hadoop Distributed File System is used to store data blocks replicated and spanned across different nodes. HDFS uses an AES based cryptographic techniques at block level which is transparent and end to end in nature. However cryptography provides security from unauthorized access to the data blocks, but a legitimate user can still harm the data. One such example was execution of malicious map reduce jar files by legitimate user which can harm the data in the HDFS. We developed a mechanism where every map reduce jar will be tested by our sandbox security to ensure the jar is not malicious and suspicious jar files are not allowed to process the data in the HDFS. This feature is not present in the existing Apache Hadoop framework and our work is made available in github for consideration and inclusion in the future versions of Apache Hadoop.


2014 ◽  
Vol 602-605 ◽  
pp. 3282-3284
Author(s):  
Fa Gui Liu ◽  
Xiao Jie Zhang

Distributed file systems such as HDFS are facing the threat of Advanced Persistent Threat, APT. Although security mechanisms such as Kerberos and ACL are implemented in distributed file systems, most of them are not sufficient to solve the threats caused by APT. With the observation into traits of APT, we propose a trusted distributed file system based on HDFS, which guarantees another further security facing APT compared to the current security mechanism.


2014 ◽  
Vol 998-999 ◽  
pp. 1362-1365
Author(s):  
Wei Feng Gao ◽  
Tie Zhu Zhao ◽  
Ming Bin Lin

Distributed file systems are emerging as a key component of large scale cloud storage platform due to the continuous growth of the amount of application data. Performance modeling and analysis is an important concern in the distributed file system area. This paper focuses on the performance prediction and modeling issues. An adaptive prediction model (APModel) is proposed to predict the performance of distributed file systems by capturing the performance correlation of different performance factors. We perform a series of experiments to validate the proposed prediction model. The experiment results indicate our proposed approach can get better prediction accuracy. It is practical and can achieve sufficient performance analysis for distributed file systems.


Sign in / Sign up

Export Citation Format

Share Document