Data Provenance in Scientific Workflows

Author(s):  
Khalid Belhajjame ◽  
Paolo Missier ◽  
Carole Goble

Data provenance is key to understanding and interpreting the results of scientific experiments. This chapter introduces and characterises data provenance in scientific workflows using illustrative examples taken from real-world workflows. The characterisation takes the form of a taxonomy that is used for comparing and analysing provenance capabilities supplied by existing scientific workflow systems.

2021 ◽  
Vol 12 (2) ◽  
Author(s):  
Leonardo Ramos ◽  
Fabio Porto ◽  
Daniel De Oliveira

Scientific research based on computer simulations is complex since it may involve managing the enormous volumes of data and metadata produced during the life cycle of a scientific experiment, from the formulation of hypotheses to its final evaluation. This wealth of data needs to be structured and managed in a way that makes sense to scientists so that relevant knowledge can be extracted to contribute to the scientific research process. In addition, when it comes to the scope of the scientific project as a whole, it may be associated with several different scientific experiments, which in turn may require executions of different scientific workflows, which makes the task rather arduous. All of this can become even more difficult if we consider that the project tasks must be associated with the execution of such simulations (which may take hours or even days), that the hypotheses of a phenomenon need validation and replication, and that the project team may be geographically dispersed. This article presents an approach called PhenoManager that aims at helping scientists managing their scientific projects and the cycle of the scientific method as a whole. PhenoManager can assist the scientist in structuring, validating, and reproducing hypotheses of a phenomenon through configurable computational models in the approach. For the evaluation of this article was used SciPhy, a scientific workflow in the field of bioinformatics, concluding that the proposed approach brings gains without considerable performance losses.


2014 ◽  
Vol 9 (2) ◽  
pp. 28-38 ◽  
Author(s):  
Víctor Cuevas-Vicenttín ◽  
Parisa Kianmajd ◽  
Bertram Ludäscher ◽  
Paolo Missier ◽  
Fernando Chirigati ◽  
...  

Scientific workflows and their supporting systems are becoming increasingly popular for compute-intensive and data-intensive scientific experiments. The advantages scientific workflows offer include rapid and easy workflow design, software and data reuse, scalable execution, sharing and collaboration, and other advantages that altogether facilitate “reproducible science”. In this context, provenance – information about the origin, context, derivation, ownership, or history of some artifact – plays a key role, since scientists are interested in examining and auditing the results of scientific experiments. However, in order to perform such analyses on scientific results as part of extended research collaborations, an adequate environment and tools are required. Concretely, the need arises for a repository that will facilitate the sharing of scientific workflows and their associated execution traces in an interoperable manner, also enabling querying and visualization. Furthermore, such functionality should be supported while taking performance and scalability into account. With this purpose in mind, we introduce PBase: a scientific workflow provenance repository implementing the ProvONE proposed standard, which extends the emerging W3C PROV standard for provenance data with workflow specific concepts. PBase is built on the Neo4j graph database, thus offering capabilities such as declarative and efficient querying. Our experiences demonstrate the power gained by supporting various types of queries for provenance data. In addition, PBase is equipped with a user friendly interface tailored for the visualization of scientific workflow provenance data, making the specification of queries and the interpretation of their results easier and more effective.


Author(s):  
Sergio Manuel Serra da Cruz ◽  
Jose Antonio Pires do Nascimento

Reproducibility is a major feature of Science. Even agronomic research of exemplary quality may have irreproducible empirical findings because of random or systematic error. The ability to reproduce agronomic experiments based on statistical data and legacy scripts are not easily achieved. We propose RFlow, a tool that aid researchers to manage, share, and enact the scientific experiments that encapsulate legacy R scripts. RFlow transparently captures provenance of scripts and endows experiments reproducibility. Unlike existing computational approaches, RFlow is non-intrusive, does not require users to change their working way, it wraps agronomic experiments in a scientific workflow system. Our computational experiments show that the tool can collect different types of provenance metadata of real experiments and enrich agronomic data with provenance metadata. This study shows the potential of RFlow to serve as the primary integration platform for legacy R scripts, with implications for other data- and compute-intensive agronomic projects.


2012 ◽  
Vol 7 (2) ◽  
pp. 92-100 ◽  
Author(s):  
Richard Littauer ◽  
Karthik Ram ◽  
Bertram Ludäscher ◽  
William Michener ◽  
Rebecca Koskela

Scientific workflows are typically used to automate the processing, analysis and management of scientific data. Most scientific workflow programs provide a user-friendly graphical user interface that enables scientists to more easily create and visualize complex workflows that may be comprised of dozens of processing and analytical steps. Furthermore, many workflows provide mechanisms for tracing provenance and methodologies that foster reproducible science. Despite their potential for enabling science, few studies have examined how the process of creating, executing, and sharing workflows can be improved. In order to promote open discourse and access to scientific methods as well as data, we analyzed a wide variety of workflow systems and publicly available workflows on the public repository myExperiment. It is hoped that understanding the usage of workflows and developing a set of recommended best practices will lead to increased contribution of workflows to the public domain.


Author(s):  
Vasa Curcin ◽  
Moustafa Ghanem ◽  
Yike Guo

Motivated by the use of scientific workflows as a user-oriented mechanism for building executable scientific data integration and analysis applications, this article introduces a framework and a set of associated methods for analysing the execution properties of scientific workflows. Our framework uses a number of formal modelling techniques to characterize the process and data behaviour of workflows and workflow components and to reason about their functional and execution properties. We use the framework to design the architecture of a customizable tool that can be used to analyse the key execution properties of scientific workflows at authoring stage. Our design is generic and can be applied to a wide variety of scientific workflow languages and systems, and is evaluated by building a prototype of the tool for the Discovery Net system. We demonstrate and discuss the utility of the framework and tool using workflows from a real-world medical informatics study.


2009 ◽  
Vol 31 (5) ◽  
pp. 721-732 ◽  
Author(s):  
Li-Wei WANG ◽  
Ze-Qian HUANG ◽  
Min LUO ◽  
Zhi-Yong PENG

2005 ◽  
Vol 34 (3) ◽  
pp. 44-49 ◽  
Author(s):  
Jia Yu ◽  
Rajkumar Buyya

2008 ◽  
Vol 16 (2-3) ◽  
pp. 205-216
Author(s):  
Bartosz Balis ◽  
Marian Bubak ◽  
Bartłomiej Łabno

Scientific workflows are a means of conducting in silico experiments in modern computing infrastructures for e-Science, often built on top of Grids. Monitoring of Grid scientific workflows is essential not only for performance analysis but also to collect provenance data and gather feedback useful in future decisions, e.g., related to optimization of resource usage. In this paper, basic problems related to monitoring of Grid scientific workflows are discussed. Being highly distributed, loosely coupled in space and time, heterogeneous, and heavily using legacy codes, workflows are exceptionally challenging from the monitoring point of view. We propose a Grid monitoring architecture for scientific workflows. Monitoring data correlation problem is described and an algorithm for on-line distributed collection of monitoring data is proposed. We demonstrate a prototype implementation of the proposed workflow monitoring architecture, the GEMINI monitoring system, and its use for monitoring of a real-life scientific workflow.


Author(s):  
Anton Michlmayr ◽  
Florian Rosenberg ◽  
Philipp Leitner ◽  
Schahram Dustdar

In general, provenance describes the origin and well-documented history of a given object. This notion has been applied in information systems, mainly to provide data provenance of scientific workflows. Similar to this, provenance in Service-oriented Computing has also focused on data provenance. However, the authors argue that in service-centric systems the origin and history of services is equally important. This paper presents an approach that addresses service provenance. The authors show how service provenance information can be collected and retrieved, and how security mechanisms guarantee integrity and access to this information, while also providing user-specific views on provenance. Finally, the paper gives a performance evaluation of the authors’ approach, which has been integrated into the VRESCo Web service runtime environment.


Sign in / Sign up

Export Citation Format

Share Document