solution for the future:  small file management   by optimizing Hadoop

Hadoop Distributed File System (HDFS) is designed to reliably store very large files across machines in a large cluster. It is one of the most used distributed file systems and offer a high availability and scalability on low-cost hardware. All Hadoopframework have HDFS as their storage component. Coupled with map reduce, which is the processing component, HDFS and Map Reduce (a processing component) have become the standard platforms for any management of big data in these days. HDFS however, in terms of design has the ability to handle huge numbers of large files, but when it comes to its deployments to handle large amounts of small files it might not be very effective. This paper puts forward a new strategy of managing small files. The approach will consists of two principal phases. The first phase will deal with the consolidating of aaclients input files, storing it continuously in a particular allocated block, that is a SequenceFile format, and so on into the next blocks. In this way we avoid the use of multiple block allocations for different streams, this reduces calls for available blocks and also reduces the metadata memory on the NameNode. Note the reason for this is that groups of small files packaged in a SequenceFile on the same block require one entry instead of one of each small file. The second phase will involve analyzing the attributes of stored small files so they can be distributed them in a way that the most called files will be referenced by an additional index as a MapFile format to reduce the read throughput during random access.

Download Full-text

Hadoop Setup

Advances in Data Mining and Database Management - Big Data Processing With Hadoop ◽

10.4018/978-1-5225-3790-8.ch004 ◽

2018 ◽

pp. 45-62

Keyword(s):

Distributed Computing ◽

File Systems ◽

Distributed File System ◽

Map Reduce ◽

Distributed File Systems ◽

Single Node ◽

Apache Hadoop ◽

Commodity Hardware ◽

Open Source Framework ◽

Hadoop Distributed File System

Apache Hadoop is an open source framework for storage and processing massive amounts of data. The skeleton of Hadoop can be viewed as distributed computing across a cluster of computers. This chapter deals with the single node, multinode setup of Hadoop environment along with the Hadoop user commands and administration commands. Hadoop processes the data on a cluster of machines with commodity hardware. It has two components, Hadoop Distributed File System for storage and Map Reduce/YARN for processing. Single node processing can be done through standalone or pseudo-distributed mode whereas multinode is through cluster mode. The execution procedure for each environment is briefly stated. Then the chapter explores the Hadoop user commands for operations like copying to and from files in distributed file systems, running jar, creating archive, setting version, classpath, etc. Further, Hadoop administration manages the configuration including functions like cluster balance, running the dfs, MapReduce admin, namenode, secondary namenode, etc.

Download Full-text

Semi-join computation on distributed file systems using map-reduce-merge model

Proceedings of the 2010 ACM Symposium on Applied Computing - SAC '10 ◽

10.1145/1774088.1774174 ◽

2010 ◽

Cited By ~ 7

Author(s):

M. Al Hajj Hassan ◽

M. Bamha

Keyword(s):

File Systems ◽

Map Reduce ◽

Distributed File Systems

Download Full-text

A Comprehensive Survey for Hadoop Distributed File System

Asian Journal of Research in Computer Science ◽

10.9734/ajrcos/2021/v11i230260 ◽

2021 ◽

pp. 46-57

Author(s):

Karwan Jameel Merceedi ◽

Nareen Abdulla Sabry

Keyword(s):

Distributed Systems ◽

Data Storage ◽

File System ◽

Low Cost ◽

File Systems ◽

Cost Effective ◽

Distributed File System ◽

Software Frameworks ◽

Hadoop Distributed File System ◽

Basic Ideas

In the last few days, data and the internet have become increasingly growing, occurring in big data. For these problems, there are many software frameworks used to increase the performance of the distributed system. This software is used for available ample data storage. One of the most beneficial software frameworks used to utilize data in distributed systems is Hadoop. This software creates machine clustering and formatting the work between them. Hadoop consists of two major components: Hadoop Distributed File System (HDFS) and Map Reduce (MR). By Hadoop, we can process, count, and distribute each word in a large file and know the number of affecting for each of them. The HDFS is designed to effectively store and transmit colossal data sets to high-bandwidth user applications. The differences between this and other file systems provided are relevant. HDFS is intended for low-cost hardware and is exceptionally tolerant to defects. Thousands of computers in a vast cluster both have directly associated storage functions and user programmers. The resource scales with demand while being cost-effective in all sizes by distributing storage and calculation through numerous servers. Depending on the above characteristics of the HDFS, many researchers worked in this field trying to enhance the performance and efficiency of the addressed file system to be one of the most active cloud systems. This paper offers an adequate study to review the essential investigations as a trend beneficial for researchers wishing to operate in such a system. The basic ideas and features of the investigated experiments were taken into account to have a robust comparison, which simplifies the selection for future researchers in this subject. According to many authors, this paper will explain what Hadoop is and its architectures, how it works, and its performance analysis in a distributed systems. In addition, assessing each Writing and compare with each other.

Download Full-text

Comparative analysis of various distributed file systems & performance evaluation using map reduce implementation

2016 International Conference on Recent Advances and Innovations in Engineering (ICRAIE) ◽

10.1109/icraie.2016.7939473 ◽

2016 ◽

Cited By ~ 1

Author(s):

Madhavi Vaidya ◽

Shrinivas Deshpande

Keyword(s):

Performance Evaluation ◽

Comparative Analysis ◽

File Systems ◽

Map Reduce ◽

Distributed File Systems

Download Full-text

Using Hadoop Distributed and Deduplicated File System (HD2FS) in Astronomy

Proceedings of the International Astronomical Union ◽

10.1017/s1743921321000387 ◽

2019 ◽

Vol 15 (S367) ◽

pp. 464-466

Author(s):

Paul Bartus

Keyword(s):

Data Storage ◽

Storage Capacity ◽

File System ◽

Storage Systems ◽

Storage System ◽

File Systems ◽

Distributed File System ◽

Distributed File Systems ◽

Output Performance ◽

Hadoop Distributed File System

AbstractDuring the last years, the amount of data has skyrocketed. As a consequence, the data has become more expensive to store than to generate. The storage needs for astronomical data are also following this trend. Storage systems in Astronomy contain redundant copies of data such as identical files or within sub-file regions. We propose the use of the Hadoop Distributed and Deduplicated File System (HD2FS) in Astronomy. HD2FS is a deduplication storage system that was created to improve data storage capacity and efficiency in distributed file systems without compromising Input/Output performance. HD2FS can be developed by modifying existing storage system environments such as the Hadoop Distributed File System. By taking advantage of deduplication technology, we can better manage the underlying redundancy of data in astronomy and reduce the space needed to store these files in the file systems, thus allowing for more capacity per volume.

Download Full-text

RRAM Random Number Generator Based on Train of Pulses

Electronics ◽

10.3390/electronics10151831 ◽

2021 ◽

Vol 10 (15) ◽

pp. 1831

Author(s):

Binbin Yang ◽

Daniel Arumí ◽

Salvador Manich ◽

Álvaro Gómez-Pau ◽

Rosa Rodríguez-Montañés ◽

...

Keyword(s):

Random Number ◽

Random Number Generator ◽

Random Access ◽

Pulse Amplitude ◽

Resistive Random Access Memory ◽

Random Numbers ◽

Number Generator ◽

Security Applications ◽

Device Resistance ◽

New Strategy

In this paper, the modulation of the conductance levels of resistive random access memory (RRAM) devices is used for the generation of random numbers by applying a train of RESET pulses. The influence of the pulse amplitude and width on the device resistance is also analyzed. For each pulse characteristic, the number of pulses required to drive the device to a particular resistance threshold is variable, and it is exploited to extract random numbers. Based on this behavior, a random number generator (RNG) circuit is proposed. To assess the performance of the circuit, the National Institute of Standards and Technology (NIST) randomness tests are applied to evaluate the randomness of the bitstreams obtained. The experimental results show that four random bits are simultaneously obtained, passing all the applied tests without the need for post-processing. The presented method provides a new strategy to generate random numbers based on RRAMs for hardware security applications.

Download Full-text