A Survey on Job Scheduling in Big Data

Abstract Big Data Applications with Scheduling becomes an active research area in last three years. The Hadoop framework becomes very popular and most used frameworks in a distributed data processing. Hadoop is also open source software that allows the user to effectively utilize the hardware. Various scheduling algorithms of the MapReduce model using Hadoop vary with design and behavior, and are used for handling many issues like data locality, awareness with resource, energy and time. This paper gives the outline of job scheduling, classification of the scheduler, and comparison of different existing algorithms with advantages, drawbacks, limitations. In this paper, we discussed various tools and frameworks used for monitoring and the ways to improve the performance in MapReduce. This paper helps the beginners and researchers in understanding the scheduling mechanisms used in Big Data.

Download Full-text

Enhancing Big Data Auditing

Computer and Information Science ◽

10.5539/cis.v11n1p90 ◽

2018 ◽

Vol 11 (1) ◽

pp. 90

Author(s):

Sara Alomari ◽

Mona Alghamdi ◽

Fahd S. Alotaibi

Keyword(s):

Big Data ◽

Research Area ◽

Provable Data Possession ◽

Integrity Verification ◽

The Core ◽

Outsourced Data ◽

Active Research ◽

Data Auditing ◽

Active Research Area ◽

Proof Of Retrievability

The auditing services of the outsourced data, especially big data, have been an active research area recently. Many schemes of remotely data auditing (RDA) have been proposed. Both categories of RDA, which are Provable Data Possession (PDP) and Proof of Retrievability (PoR), mostly represent the core schemes for most researchers to derive new schemes that support additional capabilities such as batch and dynamic auditing. In this paper, we choose the most popular PDP schemes to be investigated due to the existence of many PDP techniques which are further improved to achieve efficient integrity verification. We firstly review the work of literature to form the required knowledge about the auditing services and related schemes. Secondly, we specify a methodology to be adhered to attain the research goals. Then, we define each selected PDP scheme and the auditing properties to be used to compare between the chosen schemes. Therefore, we decide, if possible, which scheme is optimal in handling big data auditing.

Download Full-text

Sparkmach: A Distributed Data Processing System Based on Automated Machine Learning for Big Data

Information Management and Big Data - Communications in Computer and Information Science ◽

10.1007/978-3-030-11680-4_13 ◽

2019 ◽

pp. 121-128

Author(s):

Gusseppe Bravo-Rocca ◽

Piero Torres-Robatty ◽

Jose Fiestas-Iquira

Keyword(s):

Machine Learning ◽

Big Data ◽

Data Processing ◽

Processing System ◽

Distributed Data ◽

Data Processing System ◽

Distributed Data Processing ◽

Automated Machine Learning

Download Full-text

The Berlin Big Data Center (BBDC)

it - Information Technology ◽

10.1515/itit-2018-0016 ◽

2018 ◽

Vol 60 (5-6) ◽

pp. 321-326 ◽

Cited By ~ 1

Author(s):

Christoph Boden ◽

Tilmann Rabl ◽

Volker Markl

Keyword(s):

Big Data ◽

Data Analysis ◽

Data Processing ◽

Deep Understanding ◽

Automatic Parallelization ◽

Second Phase ◽

Distributed Data ◽

Domain Specific ◽

Distributed Data Processing ◽

Large Groups

Abstract The last decade has been characterized by the collection and availability of unprecedented amounts of data due to rapidly decreasing storage costs and the omnipresence of sensors and data-producing global online-services. In order to process and analyze this data deluge, novel distributed data processing systems resting on the paradigm of data flow such as Apache Hadoop, Apache Spark, or Apache Flink were built and have been scaled to tens of thousands of machines. However, writing efficient implementations of data analysis programs on these systems requires a deep understanding of systems programming, prohibiting large groups of data scientists and analysts from efficiently using this technology. In this article, we present some of the main achievements of the research carried out by the Berlin Big Data Cente (BBDC). We introduce the two domain-specific languages Emma and LARA, which are deeply embedded in Scala and enable declarative specification and the automatic parallelization of data analysis programs, the PEEL Framework for transparent and reproducible benchmark experiments of distributed data processing systems, approaches to foster the interpretability of machine learning models and finally provide an overview of the challenges to be addressed in the second phase of the BBDC.

Download Full-text

A distributed data processing platform over meteorological big data using MapReduce

International Journal of Intelligent Internet of Things Computing ◽

10.1504/ijiitc.2019.104720 ◽

2019 ◽

Vol 1 (1) ◽

pp. 74

Author(s):

Tao Huang ◽

Shengjun Xue ◽

Xiang Li ◽

Feng Luo

Keyword(s):

Big Data ◽

Data Processing ◽

Distributed Data ◽

Distributed Data Processing ◽

Processing Platform

Download Full-text

Probably Secure Keyed-Function Based Authenticated Encryption Schemes for Big Data

International Journal of Foundations of Computer Science ◽

10.1142/s0129054117400123 ◽

2017 ◽

Vol 28 (06) ◽

pp. 661-682

Author(s):

Rashed Mazumder ◽

Atsuko Miyaji ◽

Chunhua Su

Keyword(s):

Big Data ◽

Research Area ◽

Security Model ◽

Authenticated Encryption ◽

Big Data Applications ◽

Critical Issues ◽

Encryption Schemes ◽

Big Data Application ◽

Probabilistic Encryption ◽

Security Bound

Security, privacy and data integrity are the critical issues in Big Data application of IoT-enable environment and cloud-based services. There are many upcoming challenges to establish secure computations for Big Data applications. Authenticated encryption (AE) plays one of the core roles for Big Data’s confidentiality, integrity, and real-time security. However, many proposals exist in the research area of authenticated encryption. Generally, there are two concepts of nonce respect and nonce reuse under the security notion of the AE. However, recent studies show that nonce reuse needs to sacrifice security bound of the AE. In this paper, we consider nonce respect scheme and probabilistic encryption scheme which are more efficient and suitable for big data applications. Both schemes are based on keyed function. Our first scheme (FS) operates in parallel mode whose security is based on nonce respect and supports associated data. Furthermore, it needs less call of functions/block-cipher. On the contrary, our second scheme is based on probabilistic encryption. It is expected to be a light solution because of weaker security model construction. Moreover, both schemes satisfy reasonable privacy security bound.

Download Full-text

Data Analytic Models That Redress the Limitations of MapReduce

International Journal of Web-Based Learning and Teaching Technologies ◽

10.4018/ijwltt.20211101.oa7 ◽

2021 ◽

Vol 16 (6) ◽

pp. 1-15

Author(s):

Uttama Garg

Keyword(s):

Big Data ◽

Programming Model ◽

Complex Task ◽

Low Level ◽

Mapreduce Model ◽

Active Research ◽

Analytic Models ◽

Data Analytic ◽

Research Analysis

The amount of data in today’s world is increasing exponentially. Effectively analyzing Big Data is a very complex task. The MapReduce programming model created by Google in 2004 revolutionized the big-data comput-ing market. Nowadays the model is being used by many for scientific and research analysis as well as for commercial purposes. The MapReduce model however is quite a low-level progamming model and has many limitations. Active research is being undertaken to make models that overcome/remove these limitations. In this paper we have studied some popular data analytic models that redress some of the limitations of MapReduce; namely ASTERIX and Pregel (Giraph) We discuss these models briefly and through the discussion highlight how these models are able to overcome MapReduce’s limitations.

Download Full-text

Resource Provisioning and Scheduling of Big Data Processing Jobs

Advances in Data Mining and Database Management - Handbook of Research on Big Data Storage and Visualization Techniques ◽

10.4018/978-1-5225-3142-5.ch014 ◽

2018 ◽

pp. 382-401

Author(s):

Rajni Aron ◽

Deepak Kumar Aggarwal

Keyword(s):

Cloud Computing ◽

Big Data ◽

Data Processing ◽

Research Area ◽

Resource Provisioning ◽

Time Data ◽

Big Data Processing ◽

Cloud Resource Management ◽

Big Data Applications ◽

Cloud Resource

Cloud Computing has become a buzzword in the IT industry. Cloud Computing which provides inexpensive computing resources on the pay-as-you-go basis is promptly gaining momentum as a substitute for traditional Information Technology (IT) based organizations. Therefore, the increased utilization of Clouds makes an execution of Big Data processing jobs a vital research area. As more and more users have started to store/process their real-time data in Cloud environments, Resource Provisioning and Scheduling of Big Data processing jobs becomes a key element of consideration for efficient execution of Big Data applications. This chapter discusses the fundamental concepts supporting Cloud Computing & Big Data terms and the relationship between them. This chapter will help researchers find the important characteristics of Cloud Resource Management Systems to handle Big Data processing jobs and will also help to select the most suitable technique for processing Big Data jobs in Cloud Computing environment.

Download Full-text

Epilogue

Topology: A Very Short Introduction ◽

10.1093/actrade/9780198832683.003.0007 ◽

2019 ◽

pp. 128-130

Author(s):

Richard Earl

Keyword(s):

Big Data ◽

Data Analysis ◽

Research Area ◽

General Topology ◽

Topological Data Analysis ◽

Data Sets ◽

Current Interest ◽

And Topology ◽

Active Research ◽

Active Research Area

Topology remains a large, active research area in mathematics. Unsurprisingly its character has changed over the last century—there is considerably less current interest in general topology, but whole new areas have emerged, such as topological data analysis to help analyze big data sets. The Epilogue concludes that the interfaces of topology with other areas have remained rich and numerous, and it can be hard telling where topology stops and geometry or algebra or analysis or physics begin. Often that richness comes from studying structures that have interconnected flavours of algebra, geometry, and topology, but sometimes a result, seemingly of an entirely algebraic nature say, can be proved by purely topological means.

Download Full-text

Smart Surveillance System using Deep Learning

International Journal of Recent Technology and Engineering - 2 ◽

10.35940/ijrte.a1464.059120 ◽

2020 ◽

Vol 9 (1) ◽

pp. 1151-1155

Keyword(s):

Big Data ◽

Deep Learning ◽

Research Area ◽

Violence Detection ◽

Detection And Identification ◽

Big Data Applications ◽

Equal Importance ◽

Smart Surveillance ◽

Smart Surveillance System ◽

Media Data

In industry and research area big data applications are consuming most of the spaces. Among some examples of big data, the video streams from CCTV cameras as equal importance with other sources like medical data, social media data. Based on the security purpose CCTV cameras are implemented in all places where security having much importance. Security can be defined in different ways like theft identification, violence detection etc. In most of the highly secured areas security plays a major role in a real time environment. This paper discusses the detecting and recognising the facial features of the persons using deep learning concepts. This paper includes deep learning concepts starts from object detection, action detection and identification. The issues recognized in existing methods are identified and summarized.

Download Full-text

Evaluating NoSQL Databases for Big Data Processing within the Brazilian Ministry of Planning, Budget, and Management

Big Data ◽

10.4018/978-1-4666-9840-6.ch050 ◽

2016 ◽

pp. 1110-1128

Author(s):

Ruben C. Huacarpuma ◽

Daniel da C. Rodrigues ◽

Antonio M. Rubio Serrano ◽

João Paulo C. Lustosa da Costa ◽

Rafael T. de Sousa Júnior ◽

...

Keyword(s):

Big Data ◽

Data Processing ◽

Human Resources Management ◽

Daily Basis ◽

Integrated System ◽

Distributed Data ◽

Distributed Data Processing ◽

Data Framework ◽

Public Resources ◽

Big Data Technologies

The Brazilian Ministry of Planning, Budget, and Management (MP) manages enormous amounts of data that is generated on a daily basis. Processing all of this data more efficiently can reduce operating costs, thereby making better use of public resources. In this chapter, the authors construct a Big Data framework to deal with data loading and querying problems in distributed data processing. They evaluate the proposed Big Data processes by comparing them with the current centralized process used by MP in its Integrated System for Human Resources Management (in Portuguese: Sistema Integrado de Administração de Pessoal – SIAPE). This study focuses primarily on a NoSQL solution using HBase and Cassandra, which is compared to the relational PostgreSQL implementation used as a baseline. The inclusion of Big Data technologies in the proposed solution noticeably increases the performance of loading and querying time.

Download Full-text