Big Data Hadoop MapReduce Job Scheduling: A Short Survey

Author(s):  
N. Deshai ◽  
B. V. D. S. Sekhar ◽  
S. Venkataramana ◽  
K. Srinivas ◽  
G. P. S. Varma
2019 ◽  
Vol 6 (1) ◽  
Author(s):  
Mahdi Torabzadehkashi ◽  
Siavash Rezaei ◽  
Ali HeydariGorji ◽  
Hosein Bobarshad ◽  
Vladimir Alves ◽  
...  

AbstractIn the era of big data applications, the demand for more sophisticated data centers and high-performance data processing mechanisms is increasing drastically. Data are originally stored in storage systems. To process data, application servers need to fetch them from storage devices, which imposes the cost of moving data to the system. This cost has a direct relation with the distance of processing engines from the data. This is the key motivation for the emergence of distributed processing platforms such as Hadoop, which move process closer to data. Computational storage devices (CSDs) push the “move process to data” paradigm to its ultimate boundaries by deploying embedded processing engines inside storage devices to process data. In this paper, we introduce Catalina, an efficient and flexible computational storage platform, that provides a seamless environment to process data in-place. Catalina is the first CSD equipped with a dedicated application processor running a full-fledged operating system that provides filesystem-level data access for the applications. Thus, a vast spectrum of applications can be ported for running on Catalina CSDs. Due to these unique features, to the best of our knowledge, Catalina CSD is the only in-storage processing platform that can be seamlessly deployed in clusters to run distributed applications such as Hadoop MapReduce and HPC applications in-place without any modifications on the underlying distributed processing framework. For the proof of concept, we build a fully functional Catalina prototype and a CSD-equipped platform using 16 Catalina CSDs to run Intel HiBench Hadoop and HPC benchmarks to investigate the benefits of deploying Catalina CSDs in the distributed processing environments. The experimental results show up to 2.2× improvement in performance and 4.3× reduction in energy consumption, respectively, for running Hadoop MapReduce benchmarks. Additionally, thanks to the Neon SIMD engines, the performance and energy efficiency of DFT algorithms are improved up to 5.4× and 8.9×, respectively.


Author(s):  
Seethalakshmi V ◽  
Govindasamy V ◽  
Akila V
Keyword(s):  
Big Data ◽  

2016 ◽  
Vol 16 (3) ◽  
pp. 35-51 ◽  
Author(s):  
M. Senthilkumar ◽  
P. Ilango

Abstract Big Data Applications with Scheduling becomes an active research area in last three years. The Hadoop framework becomes very popular and most used frameworks in a distributed data processing. Hadoop is also open source software that allows the user to effectively utilize the hardware. Various scheduling algorithms of the MapReduce model using Hadoop vary with design and behavior, and are used for handling many issues like data locality, awareness with resource, energy and time. This paper gives the outline of job scheduling, classification of the scheduler, and comparison of different existing algorithms with advantages, drawbacks, limitations. In this paper, we discussed various tools and frameworks used for monitoring and the ways to improve the performance in MapReduce. This paper helps the beginners and researchers in understanding the scheduling mechanisms used in Big Data.


2018 ◽  
Vol 7 (2.26) ◽  
pp. 80
Author(s):  
Dr E. Laxmi Lydia ◽  
M Srinivasa Rao

The latest and famous subject all over the cloud research area is Big Data; its main appearances are volume, velocity and variety. The characteristics are difficult to manage through traditional software and their various available methodologies. To manage the data which is occurring from various domains of big data are handled through Hadoop, which is open framework software which is mainly developed to provide solutions. Handling of big data analytics is done through Hadoop Map Reduce framework and it is the key engine of hadoop cluster and it is extensively used in these days. It uses batch processing system.Apache developed an engine named "Tez", which supports interactive query system and it won't writes any temporary data into the Hadoop Distributed File System(HDFS).The paper mainly focuses on performance juxtaposition of MapReduce and TeZ, performance of these two engines are examined through the compression of input files and map output files. To compare two engines we used Bzip compression algorithm for the input files and snappy for the map out files. Word Count and Terasort gauge are used on our experiments. For the Word Count gauge, the results shown that Tez engine has better execution time than Hadoop MapReduce engine for the both compressed and non-compressed data. It has reduced the execution time nearly 39% comparing to the execution time of the Hadoop MapReduce engine. Correspondingly for the terasort gauge, the Tez engine has higher execution time than Hadoop MapReduce engine.  


2012 ◽  
Vol 5 (12) ◽  
pp. 2014-2015 ◽  
Author(s):  
Jens Dittrich ◽  
Jorge-Arnulfo Quiané-Ruiz

Sign in / Sign up

Export Citation Format

Share Document