Available techniques in hadoop small file issue

Hadoop is an optimal solution for big data processing and storing since being released in the late of 2006, hadoop data processing stands on master-slaves manner [1] that’s splits the large file job into several small files in order to process them separately, this technique was adopted instead of pushing one large file into a costly super machine to insights some useful information. Hadoop runs very good with large file of big data, but when it comes to big data in small files it could facing some problems in performance, processing slow down, data access delay, high latency and up to a completely cluster shutting down [2]. In this paper we will high light on one of hadoop’s limitations, that’s affects the data processing performance, one of these limits called “big data in small files” accrued when a massive number of small files pushed into a hadoop cluster which will rides the cluster to shut down totally. This paper also high light on some native and proposed solutions for big data in small files, how do they work to reduce the negative effects on hadoop cluster, and add extra performance on storing and accessing mechanism.

Download Full-text

Optimizing the performance of hadoop clusters through efficient cluster management techniques

International Journal of Engineering & Technology ◽

10.14419/ijet.v7i2.31.13389 ◽

2018 ◽

Vol 7 (2.31) ◽

pp. 19 ◽

Cited By ~ 1

Author(s):

K S. Shraddha Bollamma ◽

S Manishankar ◽

M V. Vishnu

Keyword(s):

Big Data ◽

Data Processing ◽

Complex Structure ◽

Optimal Solution ◽

Vital Role ◽

Energy Utilization ◽

Suitable Model ◽

Big Data Processing ◽

Cluster Architecture ◽

Hadoop Clusters

The necessity for processing the huge data has become a critical task in the age of Internet, even though data processing has evolved into a next generation level still data processing and information extraction has many problems to solve. With the increase in data size retrieving useful information with a given span of time is a herculean task. The most optimal solution that has been adopted is usage of distributed computing environment supporting data processing involving suitable model architecture with large complex structure. Although processing has achieved good amount of improvement, efficiency, energy utilization and accuracy has been compromised. The research aims to propose an efficient environment for data processing with optimized energy utilization and increased performance. Hadoop environment common and popular among big data processing platform has been chosen as base for enhancement. Creating a multi node Hadoop cluster architecture on top of which an efficient cluster monitor is setup and an algorithm to manage efficiency of the cluster is formulated. Cluster monitor is incorporated with Zoo keeper, Yarn (Node and resource manager). Zoo keeper does the monitoring of cluster nodes of the distributed system and identifies critical performance problems. Yarn plays a vital role in managing the resources efficiently and controlling the nodes with the help of hybrid scheduler algorithm. Thus this integrated platform helps in monitoring the distributed cluster as well as improving the performance of the overall Big Data processing.

Download Full-text

Optimize Parallel Data Access in Big Data Processing

2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing ◽

10.1109/ccgrid.2015.168 ◽

2015 ◽

Cited By ~ 2

Author(s):

Jiangling Yin ◽

Jun Wang

Keyword(s):

Big Data ◽

Data Processing ◽

Data Access ◽

Big Data Processing ◽

Parallel Data

Download Full-text

Upgrading a high performance computing environment for massive data processing

Journal of Internet Services and Applications ◽

10.1186/s13174-019-0118-7 ◽

2019 ◽

Vol 10 (1) ◽

Cited By ~ 1

Author(s):

Lucas M. Ponce ◽

Walter dos Santos ◽

Wagner Meira ◽

Dorgival Guedes ◽

Daniele Lezzi ◽

...

Keyword(s):

Big Data ◽

Data Processing ◽

High Performance Computing ◽

High Performance ◽

Data Access ◽

Massive Data ◽

Analysis Tool ◽

Data Framework ◽

Performance Computing ◽

Massive Data Processing

Abstract High-performance computing (HPC) and massive data processing (Big Data) are two trends that are beginning to converge. In that process, aspects of hardware architectures, systems support and programming paradigms are being revisited from both perspectives. This paper presents our experience on this path of convergence with the proposal of a framework that addresses some of the programming issues derived from such integration. Our contribution is the development of an integrated environment that integretes (i) COMPSs, a programming framework for the development and execution of parallel applications for distributed infrastructures; (ii) Lemonade, a data mining and analysis tool; and (iii) HDFS, the most widely used distributed file system for Big Data systems. To validate our framework, we used Lemonade to create COMPSs applications that access data through HDFS, and compared them with equivalent applications built with Spark, a popular Big Data framework. The results show that the HDFS integration benefits COMPSs by simplifying data access and by rearranging data transfer, reducing execution time. The integration with Lemonade facilitates COMPSs’s use and may help its popularization in the Data Science community, by providing efficient algorithm implementations for experts from the data domain that want to develop applications with a higher level abstraction.

Download Full-text

Multi-Level Attribute-Based Encryption Access Control Scheme for Big Data

MATEC Web of Conferences ◽

10.1051/matecconf/201817303047 ◽

2018 ◽

Vol 173 ◽

pp. 03047

Author(s):

Zhao Li ◽

Shuiyuan Huan

Keyword(s):

Big Data ◽

Access Control ◽

Data Processing ◽

Data Access ◽

Big Data Processing ◽

Fine Grained ◽

Computational Overhead ◽

Data Access Control ◽

Attribute Based Encryption ◽

Multi Level

There are many security threats such as data’s confidentiality and privacy protection in the new application scenario of big data processing, and for the problems such as coarse granularity and low sharing capability existing in the current research on big data access control, a new model to support fine-grained access control and flexible attribute change is proposed. Based on CP-ABE method, a multi-level attribute-based encryption scheme is designed to solve fine-grained access control problem. And to solve the problem of attribute revocation, the technique of re-encryption and version number tag is integrated into the scheme. The analysis shows that the proposed scheme can meet the security requirement of access control in big data processing environment, and has an advantage in computational overhead compared with the previous schemes.

Download Full-text

Performance Modeling and Analysis of a Hadoop Cluster for Efficient Big Data Processing

Advanced Science Letters ◽

10.1166/asl.2016.7813 ◽

2016 ◽

Vol 22 (9) ◽

pp. 2314-2319

Author(s):

Jong Beom Lim ◽

Jong-Suk Ahn ◽

Kang-Woo Lee

Keyword(s):

Big Data ◽

Data Processing ◽

Performance Modeling ◽

Big Data Processing ◽

Modeling And Analysis ◽

Performance Modeling And Analysis ◽

Hadoop Cluster

Download Full-text

A Novel Technique for Handling Small File Problem of HDFS: Hash Based Archive File (HBAF)

10.3233/apc210205 ◽

2021 ◽

Author(s):

Vijay Shankar Sharma ◽

N.C Barwar

Keyword(s):

Big Data ◽

Data Science ◽

Memory Load ◽

Data Handling ◽

End User ◽

Novel Technique ◽

Archive File ◽

Enormous Amount ◽

Massive Number ◽

Small File

Now a day’s, Data is exponentially increasing with the advancement in the data science. Each and every digital footprint is generating enormous amount of data, which is further used for processing various tasks to generate important information for different end user applications. To handle such enormous amount of data, there are number of technologies available, Hadoop/HDFS is one of the big data handling technology. HDFS can easily handle the large files but when there is the case to deal with massive number of small files, the performance of the HDFS degrades. In this paper we have proposed a novel technique Hash Based Archive File (HBAF) that can solve the small file problem of the HDFS. The proposed technique is capable to read the final index files partly, that will reduce the memory load on the Name Node and offer the file appending capability after creation of the archiv.

Download Full-text