hadoop cluster Latest Research Papers

An Enhanced Parallelisation Model for Performance Prediction of Apache Spark on a Multinode Hadoop Cluster

Big Data and Cognitive Computing ◽

10.3390/bdcc5040065 ◽

2021 ◽

Vol 5 (4) ◽

pp. 65

Author(s):

Nasim Ahmed ◽

Andre L. C. Barczak ◽

Mohammad A. Rashid ◽

Teo Susnjak

Keyword(s):

Big Data ◽

Performance Prediction ◽

Vital Role ◽

Large Datasets ◽

Apache Spark ◽

Planning System ◽

Hadoop Cluster ◽

Experimental Findings ◽

Simple Equations ◽

Efficiency And Reliability

Big data frameworks play a vital role in storing, processing, and analysing large datasets. Apache Spark has been established as one of the most popular big data engines for its efficiency and reliability. However, one of the significant problems of the Spark system is performance prediction. Spark has more than 150 configurable parameters, and configuration of so many parameters is challenging task when determining the suitable parameters for the system. In this paper, we proposed two distinct parallelisation models for performance prediction. Our insight is that each node in a Hadoop cluster can communicate with identical nodes, and a certain function of the non-parallelisable runtime can be estimated accordingly. Both models use simple equations that allows us to predict the runtime when the size of the job and the number of executables are known. The proposed models were evaluated based on five HiBench workloads, Kmeans, PageRank, Graph (NWeight), SVM, and WordCount. The workload’s empirical data were fitted with one of the two models meeting the accuracy requirements. Finally, the experimental findings show that the model can be a handy and helpful tool for scheduling and planning system deployment.

Intelligent Performance Prediction: The Use Case of a Hadoop Cluster

Electronics ◽

10.3390/electronics10212690 ◽

2021 ◽

Vol 10 (21) ◽

pp. 2690

Author(s):

Dimitris Uzunidis ◽

Panagiotis Karkazis ◽

Chara Roussou ◽

Charalampos Patrikakis ◽

Helen C. Leligou

Keyword(s):

Service Providers ◽

Single Layer ◽

Machine Learning Algorithms ◽

Service Characteristics ◽

Potential System ◽

Break Points ◽

Network Functions ◽

Hadoop Cluster ◽

Service Life Cycle

The optimum utilization of infrastructural resources is a highly desired yet cumbersome task for service providers to achieve. This is because the optimal amount of such resources is a function of various parameters, such as the desired/agreed quality of service (QoS), the service characteristics/profile, workload and service life-cycle. The advent of frameworks that foresee the dynamic establishment and placement of service and network functions further contributes to a decrease in the effectiveness of traditional resource allocation methods. In this work, we address this problem by developing a mechanism which first performs service profiling and then a prediction of the resources that would lead to the desired QoS for each newly deployed service. The main elements of our approach are as follows: a) the collection of data from all three layers of the deployed infrastructure (hardware, virtual and service), instead of a single layer of the deployed infrastructure, to provide a clearer picture on the potential system break points, b) the study of well-known container based implementations following that microservice paradigm and c) the use of a data analysis routine that employs a set of machine learning algorithms and performs accurate predictions of the required resources for any future service requests. We investigate the performance of the proposed framework using our open-source implementation to examine the case of a Hadoop cluster. The results show that running a small number of tests is adequate to assess the main system break points and at the same time to attain accurate resource predictions for any future request.

A Novel Node Management in Hadoop Cluster by Using DNA

International Journal of Information Technology Project Management ◽

10.4018/ijitpm.2021100104 ◽

2021 ◽

Vol 12 (4) ◽

pp. 38-46

Author(s):

Balaraju J. ◽

P.V.R.D. Prasada Rao

Keyword(s):

Distributed System ◽

Data Hiding ◽

Dna Sequences ◽

Sensitive Data ◽

Hadoop Cluster ◽

Physical Address

This paper proposes a novel node management for the distributed system using DNA hiding and generating a unique key by combing a unique physical address (MAC) of node and hostname. This mechanism provides better node management for the Hadoop cluster providing adding and deletion node mechanism by using limited computations and providing better node security from hackers. The objective of this paper is to design an algorithm to implement node-sensitive data hiding using DNA sequences and provide security to the node and its data from hackers.

A parallelization model for performance characterization of Spark Big Data jobs on Hadoop clusters

Journal Of Big Data ◽

10.1186/s40537-021-00499-7 ◽

2021 ◽

Vol 8 (1) ◽

Author(s):

N. Ahmed ◽

Andre L. C. Barczak ◽

Mohammad A. Rashid ◽

Teo Susnjak

Keyword(s):

Big Data ◽

Empirical Data ◽

Performance Model ◽

Problem Size ◽

Parallel Performance ◽

Big Data Applications ◽

Proposed Model ◽

Performance Patterns ◽

Hadoop Clusters ◽

Hadoop Cluster

AbstractThis article proposes a new parallel performance model for different workloads of Spark Big Data applications running on Hadoop clusters. The proposed model can predict the runtime for generic workloads as a function of the number of executors, without necessarily knowing how the algorithms were implemented. For a certain problem size, it is shown that a model based on serial boundaries for a 2D arrangement of executors can fit the empirical data for various workloads. The empirical data was obtained from a real Hadoop cluster, using Spark and HiBench. The workloads used in this work were included WordCount, SVM, Kmeans, PageRank and Graph (Nweight). A particular runtime pattern emerged when adding more executors to run a job. For some workloads, the runtime was longer with more executors added. This phenomenon is predicted with the new model of parallelisation. The resulting equation from the model explains certain performance patterns that do not fit Amdahl’s law predictions, nor Gustafson’s equation. The results show that the proposed model achieved the best fit with all workloads and most of the data sizes, using the R-squared metric for the accuracy of the fitting of empirical data. The proposed model has advantages over machine learning models due to its simplicity, requiring a smaller number of experiments to fit the data. This is very useful to practitioners in the area of Big Data because they can predict runtime of specific applications by analysing the logs. In this work, the model is limited to changes in the number of executors for a fixed problem size.

Analisa Performa Klastering Data Besar pada Hadoop

Infotek : Jurnal Informatika dan Teknologi ◽

10.29408/jit.v4i2.3565 ◽

2021 ◽

Vol 4 (2) ◽

pp. 174-183

Author(s):

Hadian Mandala Putra ◽

◽

Taufik Akbar ◽

Ahwan Ahmadi ◽

Muhammad Iman Darmawan ◽

...

Keyword(s):

Big Data ◽

Relational Model ◽

Data Types ◽

Analysis Process ◽

A Value ◽

Silhouette Coefficient ◽

Data Partitions ◽

Hadoop Cluster ◽

Map Function ◽

Cluster Quality

Big Data is a collection of data with a large and complex size, consisting of various data types and obtained from various sources, overgrowing quickly. Some of the problems that will arise when processing big data, among others, are related to the storage and access of big data, which consists of various types of data with high complexity that are not able to be handled by the relational model. One technology that can solve the problem of storing and accessing big data is Hadoop. Hadoop is a technology that can store and process big data by distributing big data into several data partitions (data blocks). Problems arise when an analysis process requires all data spread out into one data entity, for example, in the data clustering process. One alternative solution is to do a parallel and scattered analysis, then perform a centralized analysis of the results of the scattered analysis. This study examines and analyzes two methods, namely K-Medoids Mapreduce and K-Modes without Mapreduce. The dataset used is a dataset about cars consisting of 3.5 million rows of data with 400MB distributed in a Hadoop Cluster (consisting of more than one engine). Hadoop has a MapReduce feature, consisting of 2 functions, namely map and reduce. The map function performs a selection to retrieve a key, value pairs, and returns a value in the form of a collection of key value pairs, and then the reduce function combines all key value pairs from several map functions. The results of the cluster quality evaluation are tested using the Silhouette Coefficient testing metric. The K-Medoids MapReduce algorithm for the car dataset gives a silhouette value of 0.99 with a total of 2 clusters.

A Novel Hadoop Security Model for Addressing Malicious Collusive Workers

Computational Intelligence and Neuroscience ◽

10.1155/2021/5753948 ◽

2021 ◽

Vol 2021 ◽

pp. 1-10

Author(s):

Amr M. Sauber ◽

Ahmed Awad ◽

Amr F. Shawish ◽

Passent M. El-Kafrawy

Keyword(s):

Security Model ◽

Cloud Environment ◽

Master Node ◽

Computing Model ◽

System A ◽

Proposed Model ◽

Data Production ◽

Hadoop Cluster ◽

High Level ◽

Novel Model

With the daily increase of data production and collection, Hadoop is a platform for processing big data on a distributed system. A master node globally manages running jobs, whereas worker nodes process partitions of the data locally. Hadoop uses MapReduce as an effective computing model. However, Hadoop experiences a high level of security vulnerability over hybrid and public clouds. Specially, several workers can fake results without actually processing their portions of the data. Several redundancy-based approaches have been proposed to counteract this risk. A replication mechanism is used to duplicate all or some of the tasks over multiple workers (nodes). A drawback of such approaches is that they generate a high overhead over the cluster. Additionally, malicious workers can behave well for a long period of time and attack later. This paper presents a novel model to enhance the security of the cloud environment against untrusted workers. A new component called malicious workers’ trap (MWT) is developed to run on the master node to detect malicious (noncollusive and collusive) workers as they convert and attack the system. An implementation to test the proposed model and to analyze the performance of the system shows that the proposed model can accurately detect malicious workers with minor processing overhead compared to vanilla MapReduce and Verifiable MapReduce (V-MR) model [1]. In addition, MWT maintains a balance between the security and usability of the Hadoop cluster.

An Enhanced Security Measure for Multimedia Images Using Hadoop Cluster

International Journal of Operations Research and Information Systems ◽

10.4018/ijoris.20210701.oa4 ◽

2021 ◽

Vol 12 (3) ◽

pp. 1-7

Author(s):

Prakash Mohan ◽

Balasaravanan Kuppuraj ◽

Saravanakumar Chellai

Keyword(s):

Image Retrieval ◽

The Internet ◽

Secret Image ◽

Secret Information ◽

Rsa Algorithm ◽

Security Measure ◽

Multimedia Image ◽

Hadoop Cluster

Information are generated over the internet for every second. These information are not fully secured. To increase the security of these information send over the internet there are two methods Cryptography and Steganography are combined to encrypt the data using RSA algorithm as well as to hide the data in multimedia image in Hadoop Cluster. Features of the resultant image such as color are extracted and stored separately in Hadoop cluster to enhance security. Then combining features of the Stenographic image for secret image retrieval, which has been then split into image and secret information. At last, decrypting the secret information, we retrieve the actual information. Application of this system in Hadoop will increase the speed of execution of the process.

An Overview of Milestones of Big Data Analytics in Clinical and Medical Analysis

International Journal of Engineering and Advanced Technology - Regular Issue ◽

10.35940/ijeat.e2899.0610521 ◽

2021 ◽

Vol 10 (5) ◽

pp. 416-421

Author(s):

Manu M R ◽

B Balamurugan

Keyword(s):

Big Data ◽

Data Analytics ◽

Big Data Analytics ◽

Large Data ◽

Human Interaction ◽

Time Data ◽

Effective Decision ◽

Hadoop Cluster ◽

And Behavior ◽

Medical Analysis

The technological advancements make changes during availability of knowledge in a huge way. As the volume of data is increasing exponentially, there is a need for better management of data to research and industry. This data, referred to as Big Data, is now employed by various organizations to extract valuable information which may reanalyzed computationally to reveal patterns, trends and associations revealing the human interaction and behavior for making various industrial decisions But the data must be optimized, integrated, secured and visualized to make any effective decision. Analyzing of the large volume of data is not beneficial always unless it is analyzed properly. The existing techniques are insufficient to analyze the large Data and identify the frequent services accessed by the cloud users. Various services can be integrated to provide a better environment to work in emergency cases pretty earlier. Using these services, people become widely vulnerable to exposure. The data is large and provides an insight in to future predictions, which could definitely prevent maximum medical cases from happening. But without big data analytics techniques and therefore the Hadoop cluster, this data remains useless. Through this paper, we'll explain how real time data may be useful to research and predict severe

Experimental Analysis in Hadoop MapReduce: A Closer Look at Fault Detection and Recovery Techniques

Sensors ◽

10.3390/s21113799 ◽

2021 ◽

Vol 21 (11) ◽

pp. 3799

Author(s):

Muntadher Saadoon ◽

Siti Hafizah Ab Hamid ◽

Hazrina Sofian ◽

Hamza Altarturi ◽

Nur Nasuha ◽

...

Keyword(s):

Response Time ◽

Experimental Analysis ◽

Data Locality ◽

Critical Conditions ◽

Hadoop Mapreduce ◽

Recovery Techniques ◽

Recovery Times ◽

Hadoop Cluster ◽

The Impact ◽

The Relationship

Hadoop MapReduce reactively detects and recovers faults after they occur based on the static heartbeat detection and the re-execution from scratch techniques. However, these techniques lead to excessive response time penalties and inefficient resource consumption during detection and recovery. Existing fault-tolerance solutions intend to mitigate the limitations without considering critical conditions such as fail-slow faults, the impact of faults at various infrastructure levels and the relationship between the detection and recovery stages. This paper analyses the response time under two main conditions: fail-stop and fail-slow, when they manifest with node, service, and the task at runtime. In addition, we focus on the relationship between the time for detecting and recovering faults. The experimental analysis is conducted on a real Hadoop cluster comprising MapReduce, YARN and HDFS frameworks. Our analysis shows that the recovery of a single fault leads to an average of 67.6% response time penalty. Even though the detection and recovery times are well-turned, data locality and resource availability must also be considered to obtain the optimum tolerance time and the lowest penalties.

Infrastructure for Time Critical Applications of Big Data Systems Using Light Weight YARN Architecture

10.21203/rs.3.rs-436282/v1 ◽

2021 ◽

Author(s):

Saravanan A.M. ◽

K. Loheswaran ◽

G. Naga Rama Devi ◽

Karuppathal R ◽

C Balakrishnan ◽

...

Keyword(s):

Big Data ◽

Data Analytics ◽

Big Data Analytics ◽

Data Systems ◽

Computational Framework ◽

Internet Resources ◽

Big Data Systems ◽

Hadoop Cluster ◽

Almost All ◽

Time Critical

Abstract Increasing of humanity and development of Internet resources, storage size is growing with each day, whereby digital records are accessible in clouds of an exploratory format. The immediate future of Big Data is coming shortly for almost all other sectors. Big data can aid in the metamorphosis of significant company operations by offering a recommended and reliable overview of available data. Big data has also figured prominently in the detection of violence. Present framework for designing Big data implementations is capable of processing vast quantities of data through Big data analytics using collections of computing devices together to execute complex processing. Furthermore, existing technologies have not been built to fulfil the specifications of time-critical application areas and are far more oriented on real applications than on time-critical ones. This paper proposes the lightweight architecture called Yet Another Resource Negotiator (YARN), which focuses on the concept of a time-critical big-data system from the perspective of specifications and analyses the essential principles of several common big-data implementations. YARN as the normal computational framework to help MapReduce and another application instances within that Hadoop cluster. YARN requires multiple programs to execute concurrently on a constitutive common server and assent programs to delegate services depending on need. The final evaluation is accompanied by problems stemming from infrastructure and services that serve applications, recommend frameworkand provide preliminary efficiency behaviours that often contribute system impacts to implementation reliability.

hadoop cluster
Recently Published Documents

TOTAL DOCUMENTS

H-INDEX

An Enhanced Parallelisation Model for Performance Prediction of Apache Spark on a Multinode Hadoop Cluster

Intelligent Performance Prediction: The Use Case of a Hadoop Cluster

A Novel Node Management in Hadoop Cluster by Using DNA

A parallelization model for performance characterization of Spark Big Data jobs on Hadoop clusters

Analisa Performa Klastering Data Besar pada Hadoop

A Novel Hadoop Security Model for Addressing Malicious Collusive Workers

An Enhanced Security Measure for Multimedia Images Using Hadoop Cluster

An Overview of Milestones of Big Data Analytics in Clinical and Medical Analysis

Experimental Analysis in Hadoop MapReduce: A Closer Look at Fault Detection and Recovery Techniques

Infrastructure for Time Critical Applications of Big Data Systems Using Light Weight YARN Architecture

Export Citation Format

hadoop clusterRecently Published Documents

TOTAL DOCUMENTS

H-INDEX

An Enhanced Parallelisation Model for Performance Prediction of Apache Spark on a Multinode Hadoop Cluster

Intelligent Performance Prediction: The Use Case of a Hadoop Cluster

A Novel Node Management in Hadoop Cluster by Using DNA

A parallelization model for performance characterization of Spark Big Data jobs on Hadoop clusters

Analisa Performa Klastering Data Besar pada Hadoop

A Novel Hadoop Security Model for Addressing Malicious Collusive Workers

An Enhanced Security Measure for Multimedia Images Using Hadoop Cluster

An Overview of Milestones of Big Data Analytics in Clinical and Medical Analysis

Experimental Analysis in Hadoop MapReduce: A Closer Look at Fault Detection and Recovery Techniques

Infrastructure for Time Critical Applications of Big Data Systems Using Light Weight YARN Architecture

hadoop cluster
Recently Published Documents