Data Management in Scientific Workflows

Author(s):  
Ewa Deelman ◽  
Ann Chervenak

Scientific applications such as those in astronomy, earthquake science, gravitational-wave physics, and others have embraced workflow technologies to do large-scale science. Workflows enable researchers to collaboratively design, manage, and obtain results that involve hundreds of thousands of steps, access terabytes of data, and generate similar amounts of intermediate and final data products. Although workflow systems are able to facilitate the automated generation of data products, many issues still remain to be addressed. These issues exist in different forms in the workflow lifecycle. This chapter describes a workflow lifecycle as consisting of a workflow generation phase where the analysis is defined, the workflow planning phase where resources needed for execution are selected, the workflow execution part, where the actual computations take place, and the result, metadata, and provenance storing phase. The authors discuss the issues related to data management at each step of the workflow cycle. They describe challenge problems and illustrate them in the context of real-life applications. They discuss the challenges, possible solutions, and open issues faced when mapping and executing large-scale workflows on current cyberinfrastructure. They particularly emphasize the issues related to the management of data throughout the workflow lifecycle.

2021 ◽  
Author(s):  
Swati Gehlot ◽  
Karsten Peters-von Gehlen ◽  
Andrea Lammert

<p>Large scale transient climate simulations and their intercomparison with paleo data within the German initiative PalMod (www.palmod.de, currently in phase II) provides an exclusive example of applying a Data Management Plan (DMP) to conceptualise  data workflows within and outside a large multidisciplinary project. PalMod-II data products include output of three state-of-the-art climate models with various coupling complexities and spatial resolutions  simulating the climate of the past 130,000 years. Additional to the long time series of model data, a comprehensive compilation of paleo-observation data (including a model-observation-comparison toolbox, Baudouin et al, 2021 EGU-CL1.2) is envisaged for validation. </p><p>Owing to the enormous amount of data coming from models and observations, produced and handled by different groups of scientists spread across various institutions, a dedicated DMP as a living document provides a data-workflow framework for exchange and sharing of data within and outside the PalMod community. The DMP covers the data life cycle within the project starting from its generation (data formats and standards), analysis (intercomparison with models and observations), publication (usage, licences), dissemination (standardised, via ESGF) and finally archiving after the project lifetime. As an active and continually updated document, the DMP ensures the ownership and responsibilities of data subsets of various working groups along with their data sharing/reuse regulations within the working groups in order to ensure a sustained progress towards the project goals. </p><p>This contribution discusses the current status and challenges of the DMP for PalMod-II which covers the details of data produced within various working groups, project-wide workflow strategy for sharing and exchange of data, as well as a definition of a PalMod-II variable list for ESGF standard publication. The FAIR (Findability, Accessibility, Interoperability, and Reusability) data principles play a central role and are proposed for the entire life cycle of PalMod-II data products (model and proxy paleo data) for sharing/reuse during and after the project lifetime.</p><p><br><br></p>


2021 ◽  
Vol 7 ◽  
pp. e527
Author(s):  
Renan Souza ◽  
Vitor Silva ◽  
Alexandre A. B. Lima ◽  
Daniel de Oliveira ◽  
Patrick Valduriez ◽  
...  

Complex scientific experiments from various domains are typically modeled as workflows and executed on large-scale machines using a Parallel Workflow Management System (WMS). Since such executions usually last for hours or days, some WMSs provide user steering support, i.e., they allow users to run data analyses and, depending on the results, adapt the workflows at runtime. A challenge in the parallel execution control design is to manage workflow data for efficient executions while enabling user steering support. Data access for high scalability is typically transaction-oriented, while for data analysis, it is online analytical-oriented so that managing such hybrid workloads makes the challenge even harder. In this work, we present SchalaDB, an architecture with a set of design principles and techniques based on distributed in-memory data management for efficient workflow execution control and user steering. We propose a distributed data design for scalable workflow task scheduling and high availability driven by a parallel and distributed in-memory DBMS. To evaluate our proposal, we develop d-Chiron, a WMS designed according to SchalaDB’s principles. We carry out an extensive experimental evaluation on an HPC cluster with up to 960 computing cores. Among other analyses, we show that even when running data analyses for user steering, SchalaDB’s overhead is negligible for workloads composed of hundreds of concurrent tasks on shared data. Our results encourage workflow engine developers to follow a parallel and distributed data-oriented approach not only for scheduling and monitoring but also for user steering.


The Les Houches Summer School 2015 covered the emerging fields of cavity optomechanics and quantum nanomechanics. Optomechanics is flourishing and its concepts and techniques are now applied to a wide range of topics. Modern quantum optomechanics was born in the late 70s in the framework of gravitational wave interferometry, initially focusing on the quantum limits of displacement measurements. Carlton Caves, Vladimir Braginsky, and others realized that the sensitivity of the anticipated large-scale gravitational-wave interferometers (GWI) was fundamentally limited by the quantum fluctuations of the measurement laser beam. After tremendous experimental progress, the sensitivity of the upcoming next generation of GWI will effectively be limited by quantum noise. In this way, quantum-optomechanical effects will directly affect the operation of what is arguably the world’s most impressive precision experiment. However, optomechanics has also gained a life of its own with a focus on the quantum aspects of moving mirrors. Laser light can be used to cool mechanical resonators well below the temperature of their environment. After proof-of-principle demonstrations of this cooling in 2006, a number of systems were used as the field gradually merged with its condensed matter cousin (nanomechanical systems) to try to reach the mechanical quantum ground state, eventually demonstrated in 2010 by pure cryogenic techniques and a year later by a combination of cryogenic and radiation-pressure cooling. The book covers all aspects—historical, theoretical, experimental—of the field, with its applications to quantum measurement, foundations of quantum mechanics and quantum information. Essential reading for any researcher in the field.


2004 ◽  
Vol 21 (5) ◽  
pp. S1161-S1172 ◽  
Author(s):  
T Uchiyama ◽  
K Kuroda ◽  
M Ohashi ◽  
S Miyoki ◽  
H Ishitsuka ◽  
...  

2021 ◽  
Vol 55 (1) ◽  
pp. 1-2
Author(s):  
Bhaskar Mitra

Neural networks with deep architectures have demonstrated significant performance improvements in computer vision, speech recognition, and natural language processing. The challenges in information retrieval (IR), however, are different from these other application areas. A common form of IR involves ranking of documents---or short passages---in response to keyword-based queries. Effective IR systems must deal with query-document vocabulary mismatch problem, by modeling relationships between different query and document terms and how they indicate relevance. Models should also consider lexical matches when the query contains rare terms---such as a person's name or a product model number---not seen during training, and to avoid retrieving semantically related but irrelevant results. In many real-life IR tasks, the retrieval involves extremely large collections---such as the document index of a commercial Web search engine---containing billions of documents. Efficient IR methods should take advantage of specialized IR data structures, such as inverted index, to efficiently retrieve from large collections. Given an information need, the IR system also mediates how much exposure an information artifact receives by deciding whether it should be displayed, and where it should be positioned, among other results. Exposure-aware IR systems may optimize for additional objectives, besides relevance, such as parity of exposure for retrieved items and content publishers. In this thesis, we present novel neural architectures and methods motivated by the specific needs and challenges of IR tasks. We ground our contributions with a detailed survey of the growing body of neural IR literature [Mitra and Craswell, 2018]. Our key contribution towards improving the effectiveness of deep ranking models is developing the Duet principle [Mitra et al., 2017] which emphasizes the importance of incorporating evidence based on both patterns of exact term matches and similarities between learned latent representations of query and document. To efficiently retrieve from large collections, we develop a framework to incorporate query term independence [Mitra et al., 2019] into any arbitrary deep model that enables large-scale precomputation and the use of inverted index for fast retrieval. In the context of stochastic ranking, we further develop optimization strategies for exposure-based objectives [Diaz et al., 2020]. Finally, this dissertation also summarizes our contributions towards benchmarking neural IR models in the presence of large training datasets [Craswell et al., 2019] and explores the application of neural methods to other IR tasks, such as query auto-completion.


Author(s):  
Krzysztof Jurczuk ◽  
Marcin Czajkowski ◽  
Marek Kretowski

AbstractThis paper concerns the evolutionary induction of decision trees (DT) for large-scale data. Such a global approach is one of the alternatives to the top-down inducers. It searches for the tree structure and tests simultaneously and thus gives improvements in the prediction and size of resulting classifiers in many situations. However, it is the population-based and iterative approach that can be too computationally demanding to apply for big data mining directly. The paper demonstrates that this barrier can be overcome by smart distributed/parallel processing. Moreover, we ask the question whether the global approach can truly compete with the greedy systems for large-scale data. For this purpose, we propose a novel multi-GPU approach. It incorporates the knowledge of global DT induction and evolutionary algorithm parallelization together with efficient utilization of memory and computing GPU’s resources. The searches for the tree structure and tests are performed simultaneously on a CPU, while the fitness calculations are delegated to GPUs. Data-parallel decomposition strategy and CUDA framework are applied. Experimental validation is performed on both artificial and real-life datasets. In both cases, the obtained acceleration is very satisfactory. The solution is able to process even billions of instances in a few hours on a single workstation equipped with 4 GPUs. The impact of data characteristics (size and dimension) on convergence and speedup of the evolutionary search is also shown. When the number of GPUs grows, nearly linear scalability is observed what suggests that data size boundaries for evolutionary DT mining are fading.


Author(s):  
Gianluca Bardaro ◽  
Alessio Antonini ◽  
Enrico Motta

AbstractOver the last two decades, several deployments of robots for in-house assistance of older adults have been trialled. However, these solutions are mostly prototypes and remain unused in real-life scenarios. In this work, we review the historical and current landscape of the field, to try and understand why robots have yet to succeed as personal assistants in daily life. Our analysis focuses on two complementary aspects: the capabilities of the physical platform and the logic of the deployment. The former analysis shows regularities in hardware configurations and functionalities, leading to the definition of a set of six application-level capabilities (exploration, identification, remote control, communication, manipulation, and digital situatedness). The latter focuses on the impact of robots on the daily life of users and categorises the deployment of robots for healthcare interventions using three types of services: support, mitigation, and response. Our investigation reveals that the value of healthcare interventions is limited by a stagnation of functionalities and a disconnection between the robotic platform and the design of the intervention. To address this issue, we propose a novel co-design toolkit, which uses an ecological framework for robot interventions in the healthcare domain. Our approach connects robot capabilities with known geriatric factors, to create a holistic view encompassing both the physical platform and the logic of the deployment. As a case study-based validation, we discuss the use of the toolkit in the pre-design of the robotic platform for an pilot intervention, part of the EU large-scale pilot of the EU H2020 GATEKEEPER project.


2021 ◽  
Vol 5 (1) ◽  
pp. 14
Author(s):  
Christos Makris ◽  
Georgios Pispirigos

Nowadays, due to the extensive use of information networks in a broad range of fields, e.g., bio-informatics, sociology, digital marketing, computer science, etc., graph theory applications have attracted significant scientific interest. Due to its apparent abstraction, community detection has become one of the most thoroughly studied graph partitioning problems. However, the existing algorithms principally propose iterative solutions of high polynomial order that repetitively require exhaustive analysis. These methods can undoubtedly be considered resource-wise overdemanding, unscalable, and inapplicable in big data graphs, such as today’s social networks. In this article, a novel, near-linear, and highly scalable community prediction methodology is introduced. Specifically, using a distributed, stacking-based model, which is built on plain network topology characteristics of bootstrap sampled subgraphs, the underlined community hierarchy of any given social network is efficiently extracted in spite of its size and density. The effectiveness of the proposed methodology has diligently been examined on numerous real-life social networks and proven superior to various similar approaches in terms of performance, stability, and accuracy.


Sign in / Sign up

Export Citation Format

Share Document