Author(s):  
Rosa Filguiera ◽  
Amrey Krause ◽  
Malcolm Atkinson ◽  
Iraklis Klampanos ◽  
Alexander Moreno

This paper presents dispel4py, a new Python framework for describing abstract stream-based workflows for distributed data-intensive applications. These combine the familiarity of Python programming with the scalability of workflows. Data streaming is used to gain performance, rapid prototyping and applicability to live observations. dispel4py enables scientists to focus on their scientific goals, avoiding distracting details and retaining flexibility over the computing infrastructure they use. The implementation, therefore, has to map dispel4py abstract workflows optimally onto target platforms chosen dynamically. We present four dispel4py mappings: Apache Storm, message-passing interface (MPI), multi-threading and sequential, showing two major benefits: a) smooth transitions from local development on a laptop to scalable execution for production work, and b) scalable enactment on significantly different distributed computing infrastructures. Three application domains are reported and measurements on multiple infrastructures show the optimisations achieved; they have provided demanding real applications and helped us develop effective training. The dispel4py.org is an open-source project to which we invite participation. The effective mapping of dispel4py onto multiple target infrastructures demonstrates exploitation of data-intensive and high-performance computing (HPC) architectures and consistent scalability.


Author(s):  
J. Geetha ◽  
D. S. Jayalakshmi ◽  
Riya R. Ganiga ◽  
Shaguftha Zuveria Kottur ◽  
Tallapalli Surabhi

2018 ◽  
Vol 19 (3) ◽  
pp. 223-244
Author(s):  
Sonia Ikken ◽  
Eric Renault ◽  
Abdelkamel Tari ◽  
Tahar Kechadi

Several big data-driven applications are currently carried out in collaboration using distributed infrastructure. These data-driven applications usually deal with experiments at massive scale.  Data generated by such experiments are huge and stored at multiple geographic locations for reuse. Workflow systems, composed of jobs using collaborative task-based models, present new dependency and data exchange needs. This gives rise to new issues when selecting distributed data and storage resources so that the execution of applications is on time, and resource usage-cost-efficient. In this paper, we present an efficient data placement approach to improve the performance of workflow processing in distributed data centres. The proposed approach involves two types of data: splittable and unsplittable intermediate data. Moreover, we place intermediate data by considering not only their source location but also their dependencies. The main objective is to minimise the total storage cost, including the effort for transferring, storing, and moving that data according to the applications needs. We first propose an exact algorithm which takes into account the intra-job dependencies, and we show that the optimal fractional intermediate data placement problem is NP-hard. To solve the problem of unsplittable intermediate data placement, we propose a greedy heuristic algorithm based on a network flow optimisation framework. The experimental results show that the performance of our approach is very promising.  We also show  that even with divergent conditions, the cost ratio of the heuristic approach is close to the optimal solution.


2016 ◽  
Vol 12 (11) ◽  
pp. 22
Author(s):  
Yue-jie Li

The sensor data in wireless sensor networks are continuously arriving in multiple, rapid, time varying, possibly unpredictable, unbounded streams, and no record of historical information is kept. These limitations make conventional Database Management Systems and their evolution unsuitable for streams. Thereby there is a need to build a complete Data Streaming Management System (DSMS), which could process streams and perform dynamic continuous query processing. In this paper, a framework for Adaptive Distributed Data Streaming Management System (ADDSMS) is presented, which operates as streams control interface between arrays of distributed data stream sources and end-user clients who access and analyze these streams. Simulation results show that the proposed method can thus improve overall system performance substantially.


Author(s):  
Yashvi Barot

Abstract: The fundamental goal of this postulation is to introduce various models for single also as numerous inquiry handling in the Distributed data set framework which brings about less question handling cost. One of the significant issues in the plan and execution of Distributed Information Base Management Systems (DDBMS) is productive inquiry handling. The objective of dispersed inquiry improvement decreases to minimization of measure of information to be communicated among destinations for handling a given inquiry. The issue of question handling in DDBS (1 1) has been concentrated broadly in writing. In the greater part of calculations, the capability of the question will contain a grouping of tasks. In such cases, while executing tasks from right to left, as per the request for tasks in arrangement, the aftereffect of an activity might be an operand to the next activity. Since the tasks are subject to each other, at a moment in particular one activity at one site will be executed despite the fact that the climate is dispersed. Then frameworks at any remaining locales will be inactive for this inquiry. Another model, Totally Reducible Relation Model (CRK Medel), which permits parallelism and processes numerous tasks all the while at all important locales is introduced. It is expected that the tasks are in the type of conjunctions. So every activity can be handled freely. In this model at some moment, relations at every single significant site will be totally diminished by relating sets of every appropriate activity (Determinations, Semijoins and Joins) all the while. Thus, every connection will be checked just a single time to deal with all appropriate tasks by decreasing VO cost.


Sign in / Sign up

Export Citation Format

Share Document