Data Provenance in Scientific Workflows

Handbook of Research on Computational Grid Technologies for Life Sciences, Biomedicine, and Healthcare ◽

10.4018/978-1-60566-374-6.ch003 ◽

2011 ◽

pp. 46-59

Author(s):

Khalid Belhajjame ◽

Paolo Missier ◽

Carole Goble

Keyword(s):

Real World ◽

Scientific Workflow ◽

Scientific Workflows ◽

Data Provenance ◽

Workflow Systems ◽

Scientific Experiments

Data provenance is key to understanding and interpreting the results of scientific experiments. This chapter introduces and characterises data provenance in scientific workflows using illustrative examples taken from real-world workflows. The characterisation takes the form of a taxonomy that is used for comparing and analysing provenance capabilities supplied by existing scientific workflow systems.

Download Full-text

Managing Hypothesis of Scientific Experiments with PhenoManager

Journal of Information and Data Management ◽

10.5753/jidm.2021.1988 ◽

2021 ◽

Vol 12 (2) ◽

Author(s):

Leonardo Ramos ◽

Fabio Porto ◽

Daniel De Oliveira

Keyword(s):

Scientific Method ◽

Computational Models ◽

Scientific Research ◽

Research Process ◽

Scientific Workflow ◽

Scientific Workflows ◽

Scientific Experiment ◽

Geographically Dispersed ◽

Scientific Experiments ◽

Scientific Project

Scientific research based on computer simulations is complex since it may involve managing the enormous volumes of data and metadata produced during the life cycle of a scientific experiment, from the formulation of hypotheses to its final evaluation. This wealth of data needs to be structured and managed in a way that makes sense to scientists so that relevant knowledge can be extracted to contribute to the scientific research process. In addition, when it comes to the scope of the scientific project as a whole, it may be associated with several different scientific experiments, which in turn may require executions of different scientific workflows, which makes the task rather arduous. All of this can become even more difficult if we consider that the project tasks must be associated with the execution of such simulations (which may take hours or even days), that the hypotheses of a phenomenon need validation and replication, and that the project team may be geographically dispersed. This article presents an approach called PhenoManager that aims at helping scientists managing their scientific projects and the cycle of the scientific method as a whole. PhenoManager can assist the scientist in structuring, validating, and reproducing hypotheses of a phenomenon through configurable computational models in the approach. For the evaluation of this article was used SciPhy, a scientific workflow in the field of bioinformatics, concluding that the proposed approach brings gains without considerable performance losses.

Download Full-text

The PBase Scientific Workflow Provenance Repository

International Journal of Digital Curation ◽

10.2218/ijdc.v9i2.332 ◽

2014 ◽

Vol 9 (2) ◽

pp. 28-38 ◽

Cited By ~ 16

Author(s):

Víctor Cuevas-Vicenttín ◽

Parisa Kianmajd ◽

Bertram Ludäscher ◽

Paolo Missier ◽

Fernando Chirigati ◽

...

Keyword(s):

Scientific Workflow ◽

Scientific Workflows ◽

Data Reuse ◽

Data Intensive ◽

Research Collaborations ◽

Provenance Data ◽

Scientific Experiments ◽

History Of ◽

Scientific Results ◽

User Friendly

Scientific workflows and their supporting systems are becoming increasingly popular for compute-intensive and data-intensive scientific experiments. The advantages scientific workflows offer include rapid and easy workflow design, software and data reuse, scalable execution, sharing and collaboration, and other advantages that altogether facilitate “reproducible science”. In this context, provenance – information about the origin, context, derivation, ownership, or history of some artifact – plays a key role, since scientists are interested in examining and auditing the results of scientific experiments. However, in order to perform such analyses on scientific results as part of extended research collaborations, an adequate environment and tools are required. Concretely, the need arises for a repository that will facilitate the sharing of scientific workflows and their associated execution traces in an interoperable manner, also enabling querying and visualization. Furthermore, such functionality should be supported while taking performance and scalability into account. With this purpose in mind, we introduce PBase: a scientific workflow provenance repository implementing the ProvONE proposed standard, which extends the emerging W3C PROV standard for provenance data with workflow specific concepts. PBase is built on the Neo4j graph database, thus offering capabilities such as declarative and efficient querying. Our experiences demonstrate the power gained by supporting various types of queries for provenance data. In addition, PBase is equipped with a user friendly interface tailored for the visualization of scientific workflow provenance data, making the specification of queries and the interpretation of their results easier and more effective.

Download Full-text

Enriching Agronomic Experiments with Data Provenance

International Journal of Agricultural and Environmental Information Systems ◽

10.4018/ijaeis.2017070102 ◽

2017 ◽

Vol 8 (3) ◽

pp. 21-38

Author(s):

Sergio Manuel Serra da Cruz ◽

Jose Antonio Pires do Nascimento

Keyword(s):

Systematic Error ◽

Statistical Data ◽

Scientific Workflow ◽

Computational Experiments ◽

Data Provenance ◽

Computational Approaches ◽

Integration Platform ◽

Workflow System ◽

Scientific Experiments ◽

Different Types

Reproducibility is a major feature of Science. Even agronomic research of exemplary quality may have irreproducible empirical findings because of random or systematic error. The ability to reproduce agronomic experiments based on statistical data and legacy scripts are not easily achieved. We propose RFlow, a tool that aid researchers to manage, share, and enact the scientific experiments that encapsulate legacy R scripts. RFlow transparently captures provenance of scripts and endows experiments reproducibility. Unlike existing computational approaches, RFlow is non-intrusive, does not require users to change their working way, it wraps agronomic experiments in a scientific workflow system. Our computational experiments show that the tool can collect different types of provenance metadata of real experiments and enrich agronomic data with provenance metadata. This study shows the potential of RFlow to serve as the primary integration platform for legacy R scripts, with implications for other data- and compute-intensive agronomic projects.

Download Full-text

Trends in Use of Scientific Workflows: Insights from a Public Repository and Recommendations for Best Practice

International Journal of Digital Curation ◽

10.2218/ijdc.v7i2.232 ◽

2012 ◽

Vol 7 (2) ◽

pp. 92-100 ◽

Cited By ~ 21

Author(s):

Richard Littauer ◽

Karthik Ram ◽

Bertram Ludäscher ◽

William Michener ◽

Rebecca Koskela

Keyword(s):

Best Practice ◽

Scientific Data ◽

Scientific Workflow ◽

Scientific Workflows ◽

Public Repository ◽

Workflow Systems ◽

Scientific Methods ◽

The Public ◽

Open Discourse ◽

User Friendly

Scientific workflows are typically used to automate the processing, analysis and management of scientific data. Most scientific workflow programs provide a user-friendly graphical user interface that enables scientists to more easily create and visualize complex workflows that may be comprised of dozens of processing and analytical steps. Furthermore, many workflows provide mechanisms for tracing provenance and methodologies that foster reproducible science. Despite their potential for enabling science, few studies have examined how the process of creating, executing, and sharing workflows can be improved. In order to promote open discourse and access to scientific methods as well as data, we analyzed a wide variety of workflow systems and publicly available workflows on the public repository myExperiment. It is hoped that understanding the usage of workflows and developing a set of recommended best practices will lead to increased contribution of workflows to the public domain.

Download Full-text

The design and implementation of a workflow analysis tool

Philosophical Transactions of The Royal Society A Mathematical Physical and Engineering Sciences ◽

10.1098/rsta.2010.0157 ◽

2010 ◽

Vol 368 (1926) ◽

pp. 4193-4208 ◽

Cited By ~ 5

Author(s):

Vasa Curcin ◽

Moustafa Ghanem ◽

Yike Guo

Keyword(s):

Data Integration ◽

Medical Informatics ◽

Real World ◽

Scientific Data ◽

Scientific Workflow ◽

Scientific Workflows ◽

Analysis Tool ◽

Formal Modelling ◽

Design And Implementation ◽

Modelling Techniques

Motivated by the use of scientific workflows as a user-oriented mechanism for building executable scientific data integration and analysis applications, this article introduces a framework and a set of associated methods for analysing the execution properties of scientific workflows. Our framework uses a number of formal modelling techniques to characterize the process and data behaviour of workflows and workflow components and to reason about their functional and execution properties. We use the framework to design the architecture of a customizable tool that can be used to analyse the key execution properties of scientific workflows at authoring stage. Our design is generic and can be applied to a wide variety of scientific workflow languages and systems, and is evaluated by building a prototype of the tool for the Discovery Net system. We demonstrate and discuss the utility of the framework and tool using workflows from a real-world medical informatics study.

Download Full-text

Data Provenance in a Scientific Workflow Service Framework Integrated with Object Deputy Database

Chinese Journal of Computers ◽

10.3724/sp.j.1016.2008.00721 ◽

2009 ◽

Vol 31 (5) ◽

pp. 721-732 ◽

Cited By ~ 1

Author(s):

Li-Wei WANG ◽

Ze-Qian HUANG ◽

Min LUO ◽

Zhi-Yong PENG

Keyword(s):

Scientific Workflow ◽

Data Provenance ◽

Service Framework

Download Full-text

A taxonomy of scientific workflow systems for grid computing

ACM SIGMOD Record ◽

10.1145/1084805.1084814 ◽

2005 ◽

Vol 34 (3) ◽

pp. 44-49 ◽

Cited By ~ 305

Author(s):

Jia Yu ◽

Rajkumar Buyya

Keyword(s):

Grid Computing ◽

Scientific Workflow ◽

Workflow Systems

Download Full-text

Monitoring of Grid Scientific Workflows

Scientific Programming ◽

10.1155/2008/849354 ◽

2008 ◽

Vol 16 (2-3) ◽

pp. 205-216

Author(s):

Bartosz Balis ◽

Marian Bubak ◽

Bartłomiej &Lstrok;abno

Keyword(s):

Real Life ◽

Scientific Workflow ◽

Point Of View ◽

Scientific Workflows ◽

Monitoring Data ◽

Monitoring Point ◽

Loosely Coupled ◽

Provenance Data ◽

On Line ◽

Monitoring Architecture

Scientific workflows are a means of conducting in silico experiments in modern computing infrastructures for e-Science, often built on top of Grids. Monitoring of Grid scientific workflows is essential not only for performance analysis but also to collect provenance data and gather feedback useful in future decisions, e.g., related to optimization of resource usage. In this paper, basic problems related to monitoring of Grid scientific workflows are discussed. Being highly distributed, loosely coupled in space and time, heterogeneous, and heavily using legacy codes, workflows are exceptionally challenging from the monitoring point of view. We propose a Grid monitoring architecture for scientific workflows. Monitoring data correlation problem is described and an algorithm for on-line distributed collection of monitoring data is proposed. We demonstrate a prototype implementation of the proposed workflow monitoring architecture, the GEMINI monitoring system, and its use for monitoring of a real-life scientific workflow.

Download Full-text

Selective Service Provenance in the VRESCo Runtime

Web Service Composition and New Frameworks in Designing Semantics ◽

10.4018/978-1-4666-1942-5.ch017 ◽

2012 ◽

pp. 372-394

Author(s):

Anton Michlmayr ◽

Florian Rosenberg ◽

Philipp Leitner ◽

Schahram Dustdar

Keyword(s):

Performance Evaluation ◽

Information Systems ◽

Web Service ◽

Scientific Workflows ◽

Data Provenance ◽

Provenance Information ◽

Selective Service ◽

Runtime Environment ◽

Service Oriented ◽

History Of

In general, provenance describes the origin and well-documented history of a given object. This notion has been applied in information systems, mainly to provide data provenance of scientific workflows. Similar to this, provenance in Service-oriented Computing has also focused on data provenance. However, the authors argue that in service-centric systems the origin and history of services is equally important. This paper presents an approach that addresses service provenance. The authors show how service provenance information can be collected and retrieved, and how security mechanisms guarantee integrity and access to this information, while also providing user-specific views on provenance. Finally, the paper gives a performance evaluation of the authors’ approach, which has been integrated into the VRESCo Web service runtime environment.

Download Full-text

Scientific Workflow Systems for 21st Century, New Bottle or New Wine?

2008 IEEE Congress on Services - Part I ◽

10.1109/services-1.2008.79 ◽

2008 ◽

Cited By ~ 35

Author(s):

Yong Zhao ◽

Ioan Raicu ◽

Ian Foster

Keyword(s):

21St Century ◽

Scientific Workflow ◽

Workflow Systems

Download Full-text