Data-Intensive Workflow Management: For Clouds and Data-Intensive and Scalable Computing Environments

Daniel C. M. de Oliveira; Ji Liu; Esther Pacitti

doi:10.2200/s00915ed1v01y201904dtm060

Scientific workflows applied to the coupling of a continuum (Elmer v8.3) and a discrete element (HiDEM v1.0) ice dynamic model

Geoscientific Model Development ◽

10.5194/gmd-12-3001-2019 ◽

2019 ◽

Vol 12 (7) ◽

pp. 3001-3015 ◽

Cited By ~ 2

Author(s):

Shahbaz Memon ◽

Dorothée Vallot ◽

Thomas Zwinger ◽

Jan Åström ◽

Helmut Neukirchen ◽

...

Keyword(s):

Management System ◽

High Performance ◽

Heterogeneous Computing ◽

Workflow Management ◽

Scientific Workflow ◽

Workflow Management System ◽

Data Intensive ◽

Cpu Utilization ◽

Computing Environments ◽

High Level

Abstract. Scientific computing applications involving complex simulations and data-intensive processing are often composed of multiple tasks forming a workflow of computing jobs. Scientific communities running such applications on computing resources often find it cumbersome to manage and monitor the execution of these tasks and their associated data. These workflow implementations usually add overhead by introducing unnecessary input/output (I/O) for coupling the models and can lead to sub-optimal CPU utilization. Furthermore, running these workflow implementations in different environments requires significant adaptation efforts, which can hinder the reproducibility of the underlying science. High-level scientific workflow management systems (WMS) can be used to automate and simplify complex task structures by providing tooling for the composition and execution of workflows – even across distributed and heterogeneous computing environments. The WMS approach allows users to focus on the underlying high-level workflow and avoid low-level pitfalls that would lead to non-optimal resource usage while still allowing the workflow to remain portable between different computing environments. As a case study, we apply the UNICORE workflow management system to enable the coupling of a glacier flow model and calving model which contain many tasks and dependencies, ranging from pre-processing and data management to repetitive executions in heterogeneous high-performance computing (HPC) resource environments. Using the UNICORE workflow management system, the composition, management, and execution of the glacier modelling workflow becomes easier with respect to usage, monitoring, maintenance, reusability, portability, and reproducibility in different environments and by different user groups. Last but not least, the workflow helps to speed the runs up by reducing model coupling I/O overhead and it optimizes CPU utilization by avoiding idle CPU cores and running the models in a distributed way on the HPC cluster that best fits the characteristics of each model.

Download Full-text

Performance Evaluations of Distributed File Systems for Scientific Big Data in FUSE Environment

Electronics ◽

10.3390/electronics10121471 ◽

2021 ◽

Vol 10 (12) ◽

pp. 1471

Author(s):

Jun-Yeong Lee ◽

Moon-Hyun Kim ◽

Syed Asif Raza Raza Shah ◽

Sang-Un Ahn ◽

Heejun Yoon ◽

...

Keyword(s):

Data Storage ◽

Scale Up ◽

File Systems ◽

Performance Evaluations ◽

Distributed File Systems ◽

Data Intensive Computing ◽

Data Intensive ◽

Tremendous Amount ◽

Computing Environments ◽

And Performance

Data are important and ever growing in data-intensive scientific environments. Such research data growth requires data storage systems that play pivotal roles in data management and analysis for scientific discoveries. Redundant Array of Independent Disks (RAID), a well-known storage technology combining multiple disks into a single large logical volume, has been widely used for the purpose of data redundancy and performance improvement. However, this requires RAID-capable hardware or software to build up a RAID-enabled disk array. In addition, it is difficult to scale up the RAID-based storage. In order to mitigate such a problem, many distributed file systems have been developed and are being actively used in various environments, especially in data-intensive computing facilities, where a tremendous amount of data have to be handled. In this study, we investigated and benchmarked various distributed file systems, such as Ceph, GlusterFS, Lustre and EOS for data-intensive environments. In our experiment, we configured the distributed file systems under a Reliable Array of Independent Nodes (RAIN) structure and a Filesystem in Userspace (FUSE) environment. Our results identify the characteristics of each file system that affect the read and write performance depending on the features of data, which have to be considered in data-intensive computing environments.

Download Full-text

BigSift: automated debugging of big data analytics in data-intensive scalable computing

Proceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering - ESEC/FSE 2018 ◽

10.1145/3236024.3264586 ◽

2018 ◽

Cited By ~ 3

Author(s):

Muhammad Ali Gulzar ◽

Siman Wang ◽

Miryung Kim

Keyword(s):

Big Data ◽

Data Analytics ◽

Big Data Analytics ◽

Scalable Computing ◽

Data Intensive ◽

Automated Debugging

Download Full-text

Data Intensive Scalable Computing (DISC)

10.5753/eradsp.2020.16873 ◽

2020 ◽

Author(s):

Mario A. R. Dantas

Keyword(s):

Big Data ◽

Scalable Computing ◽

Data Intensive ◽

And Storage

This work presents an introduction to the Data Intensive Scalable Computing (DISC) approach. This paradigm represents a valuable effort to tackle the large amount of data produced by several ordinary applications. Therefore, subjects such as characterization of big data and storage approaches, in addition to brief comparison between HPC and DISC are differentiated highlight.

Download Full-text

On Task Assignment in Data Intensive Scalable Computing

Job Scheduling Strategies for Parallel Processing - Lecture Notes in Computer Science ◽

10.1007/978-3-662-43779-7_8 ◽

2014 ◽

pp. 136-155

Author(s):

Giovanni Agosta ◽

Gerardo Pelosi ◽

Ettore Speziale

Keyword(s):

Task Assignment ◽

Scalable Computing ◽

Data Intensive

Download Full-text

A Comprehensive Survey on Data-Intensive Computing and MapReduce Paradigm in Cloud Computing Environments

Informatics and Communication Technologies for Societal Development ◽

10.1007/978-81-322-1916-3_9 ◽

2014 ◽

pp. 85-93

Author(s):

Girish Neelakanta Iyer ◽

Salaja Silas

Keyword(s):

Cloud Computing ◽

Data Intensive Computing ◽

Data Intensive ◽

Comprehensive Survey ◽

Computing Environments ◽

Mapreduce Paradigm

Download Full-text

Proceedings of the 2nd Joint International Workshop on Parallel Data Storage & Data Intensive Scalable Computing Systems - PDSW-DISCS '17

10.1145/3149393 ◽

2017 ◽

Keyword(s):

Data Storage ◽

International Workshop ◽

Computing Systems ◽

Scalable Computing ◽

Data Intensive ◽

Parallel Data

Download Full-text

A New Data Classification Algorithm for Data-Intensive Computing Environments

Advanced Materials Research ◽

10.4028/www.scientific.net/amr.756-759.3318 ◽

2013 ◽

Vol 756-759 ◽

pp. 3318-3323

Author(s):

Qi Zhi Deng ◽

Long Bo Zhang ◽

Xin Qian ◽

Ya Li Chen ◽

Feng Ying Wang

Keyword(s):

Data Mining ◽

Large Datasets ◽

Data Availability ◽

Learning Method ◽

Data Intensive Computing ◽

Data Intensive ◽

Distributed Computations ◽

Split Point ◽

Computing Environments ◽

Mapreduce Model

In order to solve the problem of how to improve the scalability of data processing capabilities and the data availability which encountered by data mining techniques for Data-intensive computing, a new method of tree learning is presented in this paper. By introducing the MapReduce, the tree learning method based on SPRINT can obtain a well scalability when address large datasets. Moreover, we define the process of split point as a series of distributed computations, which is implemented with the MapReduce model respectively. And a new data structure called class distribution table is introduced to assist the calculation of histogram. Experiments and results analysis shows that the algorithm has strong processing capabilities of data mining for data-intensive computing environments.

Download Full-text