scholarly journals Available techniques in hadoop small file issue

Author(s):  
M. B. Masadeh ◽  
M. S. Azmi ◽  
S. S. S. Ahmad

Hadoop is an optimal solution for big data processing and storing since being released in the late of 2006, hadoop data processing stands on master-slaves manner [1] that’s splits the large file job into several small files in order to process them separately, this technique was adopted instead of pushing one large file into a costly super machine to insights some useful information. Hadoop runs very good with large file of big data, but when it comes to big data in small files it could facing some problems in performance, processing slow down, data access delay, high latency and up to a completely cluster shutting down [2]. In this paper we will high light on one of hadoop’s limitations, that’s affects the data processing performance, one of these limits called “big data in small files” accrued when a massive number of small files pushed into a hadoop cluster which will rides the cluster to shut down totally. This paper also high light on some native and proposed solutions for big data in small files, how do they work to reduce the negative effects on hadoop cluster, and add extra performance on storing and accessing mechanism.

2018 ◽  
Vol 7 (2.31) ◽  
pp. 19 ◽  
Author(s):  
K S. Shraddha Bollamma ◽  
S Manishankar ◽  
M V. Vishnu

The necessity for processing the huge data has become a critical task in the age of Internet, even though data processing has evolved into a next generation level still data processing and information extraction has many problems to solve. With the increase in data size retrieving useful information with a given span of time is a herculean task. The most optimal solution that has been adopted is usage of distributed computing environment supporting data processing involving suitable model architecture with large complex structure. Although processing has achieved good amount of improvement, efficiency, energy utilization and accuracy has been compromised. The research aims to propose an efficient environment for data processing with optimized energy utilization and increased performance. Hadoop environment common and popular among big data processing platform has been chosen as base for enhancement. Creating a multi node Hadoop cluster architecture on top of which an efficient cluster monitor is setup and an algorithm to manage efficiency of the cluster is formulated. Cluster monitor is incorporated with Zoo keeper, Yarn (Node and resource manager). Zoo keeper does the monitoring of cluster nodes of the distributed system and identifies critical performance problems. Yarn plays a vital role in managing the resources efficiently and controlling the nodes with the help of hybrid scheduler algorithm. Thus this integrated platform helps in monitoring the distributed cluster as well as improving the performance of the overall Big Data processing.   


Author(s):  
Lucas M. Ponce ◽  
Walter dos Santos ◽  
Wagner Meira ◽  
Dorgival Guedes ◽  
Daniele Lezzi ◽  
...  

Abstract High-performance computing (HPC) and massive data processing (Big Data) are two trends that are beginning to converge. In that process, aspects of hardware architectures, systems support and programming paradigms are being revisited from both perspectives. This paper presents our experience on this path of convergence with the proposal of a framework that addresses some of the programming issues derived from such integration. Our contribution is the development of an integrated environment that integretes (i) COMPSs, a programming framework for the development and execution of parallel applications for distributed infrastructures; (ii) Lemonade, a data mining and analysis tool; and (iii) HDFS, the most widely used distributed file system for Big Data systems. To validate our framework, we used Lemonade to create COMPSs applications that access data through HDFS, and compared them with equivalent applications built with Spark, a popular Big Data framework. The results show that the HDFS integration benefits COMPSs by simplifying data access and by rearranging data transfer, reducing execution time. The integration with Lemonade facilitates COMPSs’s use and may help its popularization in the Data Science community, by providing efficient algorithm implementations for experts from the data domain that want to develop applications with a higher level abstraction.


2018 ◽  
Vol 173 ◽  
pp. 03047
Author(s):  
Zhao Li ◽  
Shuiyuan Huan

There are many security threats such as data’s confidentiality and privacy protection in the new application scenario of big data processing, and for the problems such as coarse granularity and low sharing capability existing in the current research on big data access control, a new model to support fine-grained access control and flexible attribute change is proposed. Based on CP-ABE method, a multi-level attribute-based encryption scheme is designed to solve fine-grained access control problem. And to solve the problem of attribute revocation, the technique of re-encryption and version number tag is integrated into the scheme. The analysis shows that the proposed scheme can meet the security requirement of access control in big data processing environment, and has an advantage in computational overhead compared with the previous schemes.


2021 ◽  
Author(s):  
Vijay Shankar Sharma ◽  
N.C Barwar

Now a day’s, Data is exponentially increasing with the advancement in the data science. Each and every digital footprint is generating enormous amount of data, which is further used for processing various tasks to generate important information for different end user applications. To handle such enormous amount of data, there are number of technologies available, Hadoop/HDFS is one of the big data handling technology. HDFS can easily handle the large files but when there is the case to deal with massive number of small files, the performance of the HDFS degrades. In this paper we have proposed a novel technique Hash Based Archive File (HBAF) that can solve the small file problem of the HDFS. The proposed technique is capable to read the final index files partly, that will reduce the memory load on the Name Node and offer the file appending capability after creation of the archiv.


2019 ◽  
Vol 12 (1) ◽  
pp. 42 ◽  
Author(s):  
Andrey I. Vlasov ◽  
Konstantin A. Muraviev ◽  
Alexandra A. Prudius ◽  
Demid A. Uzenkov

Sign in / Sign up

Export Citation Format

Share Document