Big Data Workflows: Locality-Aware Orchestration Using Software Containers

The emergence of the edge computing paradigm has shifted data processing from centralised infrastructures to heterogeneous and geographically distributed infrastructures. Therefore, data processing solutions must consider data locality to reduce the performance penalties from data transfers among remote data centres. Existing big data processing solutions provide limited support for handling data locality and are inefficient in processing small and frequent events specific to the edge environments. This article proposes a novel architecture and a proof-of-concept implementation for software container-centric big data workflow orchestration that puts data locality at the forefront. The proposed solution considers the available data locality information, leverages long-lived containers to execute workflow steps, and handles the interaction with different data sources through containers. We compare the proposed solution with Argo workflows and demonstrate a significant performance improvement in the execution speed for processing the same data units. Finally, we carry out experiments with the proposed solution under different configurations and analyze individual aspects affecting the performance of the overall solution.

Download Full-text

DynDL: Scheduling Data-Locality-Aware Tasks with Dynamic Data Transfer Cost for Multicore-Server-Based Big Data Clusters

Applied Sciences ◽

10.3390/app8112216 ◽

2018 ◽

Vol 8 (11) ◽

pp. 2216

Author(s):

Jiahui Jin ◽

Qi An ◽

Wei Zhou ◽

Jiakai Tang ◽

Runqun Xiong

Keyword(s):

Big Data ◽

Data Processing ◽

Processing Time ◽

Data Transfer ◽

Data Locality ◽

Free Time ◽

Time Data ◽

Dynamic Data ◽

Network Bandwidth ◽

Transfer Cost

Network bandwidth is a scarce resource in big data environments, so data locality is a fundamental problem for data-parallel frameworks such as Hadoop and Spark. This problem is exacerbated in multicore server-based clusters, where multiple tasks running on the same server compete for the server’s network bandwidth. Existing approaches solve this problem by scheduling computational tasks near the input data and considering the server’s free time, data placements, and data transfer costs. However, such approaches usually set identical values for data transfer costs, even though a multicore server’s data transfer cost increases with the number of data-remote tasks. Eventually, this hampers data-processing time, by minimizing it ineffectively. As a solution, we propose DynDL (Dynamic Data Locality), a novel data-locality-aware task-scheduling model that handles dynamic data transfer costs for multicore servers. DynDL offers greater flexibility than existing approaches by using a set of non-decreasing functions to evaluate dynamic data transfer costs. We also propose online and offline algorithms (based on DynDL) that minimize data-processing time and adaptively adjust data locality. Although DynDL is NP-complete (nondeterministic polynomial-complete), we prove that the offline algorithm runs in quadratic time and generates optimal results for DynDL’s specific uses. Using a series of simulations and real-world executions, we show that our algorithms are 30% better than algorithms that do not consider dynamic data transfer costs in terms of data-processing time. Moreover, they can adaptively adjust data localities based on the server’s free time, data placement, and network bandwidth, and schedule tens of thousands of tasks within subseconds or seconds.

Download Full-text

Cost Minimization for Big Data Processing in Geo-Distributed Data Centres

Asia-pacific Journal of Convergent Research Interchange ◽

10.21742/apjcri.2016.12.05 ◽

2016 ◽

Vol 2 (4) ◽

pp. 35-43

Author(s):

T. Sai Raaga Sowmya

Keyword(s):

Big Data ◽

Data Processing ◽

Cost Minimization ◽

Distributed Data ◽

Big Data Processing ◽

Data Centres

Download Full-text

A Survey of Fog Computing-Based Healthcare Big Data Analytics and Its Security

International Journal of Ambient Computing and Intelligence ◽

10.4018/ijaci.2021040104 ◽

2021 ◽

Vol 12 (2) ◽

pp. 53-72

Author(s):

Rojalina Priyadarshini ◽

Rabindra Kumar Barik ◽

Harish Chandra Dubey ◽

Brojo Kishore Mishra

Keyword(s):

Big Data ◽

Data Processing ◽

Fog Computing ◽

Big Data Analytics ◽

Machine Intelligence ◽

Future Research ◽

Data Generation ◽

Local Data ◽

Research Areas ◽

Computing Paradigm

Growing use of wearables within internet of things (IoT) creates ever-increasing multi-modal data from various smart health applications. The enormous volume of data generation creates new challenges in transmission, storage, and processing. There were challenges such as communication latency and data security associated with processing medical big data in cloud backend. Fog computing (FC) is an emerging distributed computing paradigm that solved these problems by leveraging local data processing, storage, filtering, and machine intelligence within an intermediate fog layer that resides between cloud and wearables devices. This paper focuses on doing survey on two major aspects of deploying fog computing for smart and connected health. Firstly, the role of machine learning-based edge intelligence in fog layer for data processing is investigated. A comprehensive analysis is provided during the survey, highlighting the strength and improvements in the existing literature. The paper ends with some open challenges and future research areas in the domain of fog-based healthcare.

Download Full-text

A Survey on Geographically Distributed Big-Data Processing Using MapReduce

IEEE Transactions on Big Data ◽

10.1109/tbdata.2017.2723473 ◽

2019 ◽

Vol 5 (1) ◽

pp. 60-80 ◽

Cited By ~ 15

Author(s):

Shlomi Dolev ◽

Patricia Florissi ◽

Ehud Gudes ◽

Shantanu Sharma ◽

Ido Singer

Keyword(s):

Big Data ◽

Data Processing ◽

Big Data Processing ◽

Geographically Distributed

Download Full-text

Various Approaches Proposed for Eliminating Duplicate Data in a System

Communications - Scientific letters of the University of Zilina ◽

10.26552/com.c.2021.4.a223-a232 ◽

2021 ◽

Vol 23 (4) ◽

pp. A223-A232

Author(s):

Roman Čerešňák ◽

Karol Matiaško ◽

Adam Dudáš

Keyword(s):

Big Data ◽

Data Processing ◽

Data Centers ◽

Data Communication ◽

Cloud Services ◽

Computational Time ◽

Distributed Data ◽

Data Services ◽

Geographically Distributed ◽

Tight Connection

The growth of big data processing market led to an increase in the overload of computation data centers, change of methods used in storing the data, communication between the computing units and computational time needed to process or edit the data. Methods of distributed or parallel data processing brought new problems related to computations with data which need to be examined. Unlike the conventional cloud services, a tight connection between the data and the computations is one of the main characteristics of the big data services. The computational tasks can be done only if relevant data are available. Three factors, which influence the speed and efficiency of data processing are - data duplicity, data integrity and data security. We are motivated to study the problems related to the growing time needed for data processing by optimizing these three factors in geographically distributed data centers.

Download Full-text