scholarly journals A parallelization model for performance characterization of Spark Big Data jobs on Hadoop clusters

2021 ◽  
Vol 8 (1) ◽  
Author(s):  
N. Ahmed ◽  
Andre L. C. Barczak ◽  
Mohammad A. Rashid ◽  
Teo Susnjak

AbstractThis article proposes a new parallel performance model for different workloads of Spark Big Data applications running on Hadoop clusters. The proposed model can predict the runtime for generic workloads as a function of the number of executors, without necessarily knowing how the algorithms were implemented. For a certain problem size, it is shown that a model based on serial boundaries for a 2D arrangement of executors can fit the empirical data for various workloads. The empirical data was obtained from a real Hadoop cluster, using Spark and HiBench. The workloads used in this work were included WordCount, SVM, Kmeans, PageRank and Graph (Nweight). A particular runtime pattern emerged when adding more executors to run a job. For some workloads, the runtime was longer with more executors added. This phenomenon is predicted with the new model of parallelisation. The resulting equation from the model explains certain performance patterns that do not fit Amdahl’s law predictions, nor Gustafson’s equation. The results show that the proposed model achieved the best fit with all workloads and most of the data sizes, using the R-squared metric for the accuracy of the fitting of empirical data. The proposed model has advantages over machine learning models due to its simplicity, requiring a smaller number of experiments to fit the data. This is very useful to practitioners in the area of Big Data because they can predict runtime of specific applications by analysing the logs. In this work, the model is limited to changes in the number of executors for a fixed problem size.

2014 ◽  
Vol 621 ◽  
pp. 235-240
Author(s):  
Yue Gao Tang ◽  
Li Miao ◽  
Feng Ping Chen

In the age of big data, MapReduce is developed as an important tool to process massive datasets in a parallel way on cluster and Hadoop is an open-source implementation of it. However, with the increasing size of clusters, it becomes more and more difficult to identify and diagnose faulty nodes, especially those continuing running but with degraded performance. Then, based on an observation that the behaviors of all nodes in the cluster are relatively similar, we propose a peer-comparison approach that can automatically diagnose performance problems in Hadoop cluster through extracting, analyzing both Hadoop logs and OS-level performance metrics on each node. Compared with previous works, our approach is more scalable and effective and can pinpoint the underlying bug of faulty node in Hadoop clusters.


Author(s):  
Ganesh Panatula ◽  
K Sailaja Kumar ◽  
D Evangelin Geetha ◽  
T V Suresh Kumar

<p>In the era of rapid growth of cloud computing, performance calculation of cloud service is an essential criterion to assure quality of service. Nevertheless, it is a perplexing task to effectively analyze the performance of cloud service due to the complexity of cloud resources and the diversity of Big Data applications. Hence, we propose to examine the performance of Big Data applications with Hadoop and thus to figure out the performance in cloud cluster. Hadoop is built based on MapReduce, one of the widely used programming models in Big Data. In this paper, the performance analysis of Hadoop MapReduce WordCount application for Twitter data is presented. A 4-node in-house Hadoop cluster was setup and experiment was carried out for analyzing the performance. Through this work, it was concluded that Hadoop is efficient for BigData applications with 3 or more nodes with replication factor 3. Also, it was observed that system time was relatively more compared to user time for BigData applications beyond 80GB. This experiment had also thrown certain pattern on actual data blocks used to process the WordCount application. </p>


2020 ◽  
Vol 13 (4) ◽  
pp. 790-797
Author(s):  
Gurjit Singh Bhathal ◽  
Amardeep Singh Dhiman

Background: In current scenario of internet, large amounts of data are generated and processed. Hadoop framework is widely used to store and process big data in a highly distributed manner. It is argued that Hadoop Framework is not mature enough to deal with the current cyberattacks on the data. Objective: The main objective of the proposed work is to provide a complete security approach comprising of authorisation and authentication for the user and the Hadoop cluster nodes and to secure the data at rest as well as in transit. Methods: The proposed algorithm uses Kerberos network authentication protocol for authorisation and authentication and to validate the users and the cluster nodes. The Ciphertext-Policy Attribute- Based Encryption (CP-ABE) is used for data at rest and data in transit. User encrypts the file with their own set of attributes and stores on Hadoop Distributed File System. Only intended users can decrypt that file with matching parameters. Results: The proposed algorithm was implemented with data sets of different sizes. The data was processed with and without encryption. The results show little difference in processing time. The performance was affected in range of 0.8% to 3.1%, which includes impact of other factors also, like system configuration, the number of parallel jobs running and virtual environment. Conclusion: The solutions available for handling the big data security problems faced in Hadoop framework are inefficient or incomplete. A complete security framework is proposed for Hadoop Environment. The solution is experimentally proven to have little effect on the performance of the system for datasets of different sizes.


2020 ◽  
Author(s):  
Anusha Ampavathi ◽  
Vijaya Saradhi T

UNSTRUCTURED Big data and its approaches are generally helpful for healthcare and biomedical sectors for predicting the disease. For trivial symptoms, the difficulty is to meet the doctors at any time in the hospital. Thus, big data provides essential data regarding the diseases on the basis of the patient’s symptoms. For several medical organizations, disease prediction is important for making the best feasible health care decisions. Conversely, the conventional medical care model offers input as structured that requires more accurate and consistent prediction. This paper is planned to develop the multi-disease prediction using the improvised deep learning concept. Here, the different datasets pertain to “Diabetes, Hepatitis, lung cancer, liver tumor, heart disease, Parkinson’s disease, and Alzheimer’s disease”, from the benchmark UCI repository is gathered for conducting the experiment. The proposed model involves three phases (a) Data normalization (b) Weighted normalized feature extraction, and (c) prediction. Initially, the dataset is normalized in order to make the attribute's range at a certain level. Further, weighted feature extraction is performed, in which a weight function is multiplied with each attribute value for making large scale deviation. Here, the weight function is optimized using the combination of two meta-heuristic algorithms termed as Jaya Algorithm-based Multi-Verse Optimization algorithm (JA-MVO). The optimally extracted features are subjected to the hybrid deep learning algorithms like “Deep Belief Network (DBN) and Recurrent Neural Network (RNN)”. As a modification to hybrid deep learning architecture, the weight of both DBN and RNN is optimized using the same hybrid optimization algorithm. Further, the comparative evaluation of the proposed prediction over the existing models certifies its effectiveness through various performance measures.


Author(s):  
Jonatan Enes ◽  
Guillaume Fieni ◽  
Roberto R. Exposito ◽  
Romain Rouvoy ◽  
Juan Tourino

2019 ◽  
Vol 6 (1) ◽  
Author(s):  
Mahdi Torabzadehkashi ◽  
Siavash Rezaei ◽  
Ali HeydariGorji ◽  
Hosein Bobarshad ◽  
Vladimir Alves ◽  
...  

AbstractIn the era of big data applications, the demand for more sophisticated data centers and high-performance data processing mechanisms is increasing drastically. Data are originally stored in storage systems. To process data, application servers need to fetch them from storage devices, which imposes the cost of moving data to the system. This cost has a direct relation with the distance of processing engines from the data. This is the key motivation for the emergence of distributed processing platforms such as Hadoop, which move process closer to data. Computational storage devices (CSDs) push the “move process to data” paradigm to its ultimate boundaries by deploying embedded processing engines inside storage devices to process data. In this paper, we introduce Catalina, an efficient and flexible computational storage platform, that provides a seamless environment to process data in-place. Catalina is the first CSD equipped with a dedicated application processor running a full-fledged operating system that provides filesystem-level data access for the applications. Thus, a vast spectrum of applications can be ported for running on Catalina CSDs. Due to these unique features, to the best of our knowledge, Catalina CSD is the only in-storage processing platform that can be seamlessly deployed in clusters to run distributed applications such as Hadoop MapReduce and HPC applications in-place without any modifications on the underlying distributed processing framework. For the proof of concept, we build a fully functional Catalina prototype and a CSD-equipped platform using 16 Catalina CSDs to run Intel HiBench Hadoop and HPC benchmarks to investigate the benefits of deploying Catalina CSDs in the distributed processing environments. The experimental results show up to 2.2× improvement in performance and 4.3× reduction in energy consumption, respectively, for running Hadoop MapReduce benchmarks. Additionally, thanks to the Neon SIMD engines, the performance and energy efficiency of DFT algorithms are improved up to 5.4× and 8.9×, respectively.


2017 ◽  
Vol 117 (9) ◽  
pp. 1866-1889 ◽  
Author(s):  
Vahid Shokri Kahi ◽  
Saeed Yousefi ◽  
Hadi Shabanpour ◽  
Reza Farzipoor Saen

Purpose The purpose of this paper is to develop a novel network and dynamic data envelopment analysis (DEA) model for evaluating sustainability of supply chains. In the proposed model, all links can be considered in calculation of efficiency score. Design/methodology/approach A dynamic DEA model to evaluate sustainable supply chains in which networks have series structure is proposed. Nature of free links is defined and subsequently applied in calculating relative efficiency of supply chains. An additive network DEA model is developed to evaluate sustainability of supply chains in several periods. A case study demonstrates applicability of proposed approach. Findings This paper assists managers to identify inefficient supply chains and take proper remedial actions for performance optimization. Besides, overall efficiency scores of supply chains have less fluctuation. By utilizing the proposed model and determining dual-role factors, managers can plan their supply chains properly and more accurately. Research limitations/implications In real world, managers face with big data. Therefore, we need to develop an approach to deal with big data. Practical implications The proposed model offers useful managerial implications along with means for managers to monitor and measure efficiency of their production processes. The proposed model can be applied in real world problems in which decision makers are faced with multi-stage processes such as supply chains, production systems, etc. Originality/value For the first time, the authors present additive model of network-dynamic DEA. For the first time, the authors outline the links in a way that carry-overs of networks are connected in different periods and not in different stages.


Sign in / Sign up

Export Citation Format

Share Document