uap: reproducible and robust HTS data analysis

Abstract Background A lack of reproducibility has been repeatedly criticized in computational research. High throughput sequencing (HTS) data analysis is a complex multi-step process. For most of the steps a range of bioinformatic tools is available and for most tools manifold parameters need to be set. Due to this complexity, HTS data analysis is particularly prone to reproducibility and consistency issues. We have defined four criteria that in our opinion ensure a minimal degree of reproducible research for HTS data analysis. A series of workflow management systems is available for assisting complex multi-step data analyses. However, to the best of our knowledge, none of the currently available work flow management systems satisfies all four criteria for reproducible HTS analysis. Results Here we present , a workflow management system dedicated to robust, consistent, and reproducible HTS data analysis. is optimized for the application to omics data, but can be easily extended to other complex analyses. It is available under the GNU GPL v3 license at https://github.com/yigbt/uap. Conclusions is a freely available tool that enables researchers to easily adhere to reproducible research principles for HTS data analyses.

Download Full-text

uap: Reproducible and Robust HTS Data Analysis

10.1101/690438 ◽

2019 ◽

Author(s):

Christoph Kämpf ◽

Michael Specht ◽

Alexander Scholz ◽

Sven-Holger Puppel ◽

Gero Doose ◽

...

Keyword(s):

Data Analysis ◽

High Throughput Sequencing ◽

Workflow Management ◽

Work Flow ◽

Management Systems ◽

Reproducible Research ◽

Bioinformatic Tools ◽

Data Analyses ◽

Computational Research ◽

Work Flow Management

AbstractBackgroundA lack of reproducibility has been repeatedly criticized in computational research. High throughput sequencing (HTS) data analysis is a complex multi-step process. For most of the steps a range of bioinformatic tools is available and for most tools manifold parameters need to be set. Due to this complexity, HTS data analysis is particularly prone to reproducibility and consistency issues. We have defined four criteria that in our opinion ensure a minimal degree of reproducible research for HTS data analysis. A series of workflow management systems is available for assisting complex multi-step data analyses. However, to the best of our knowledge, none of the currently available work flow management systems satisfies all four criteria for reproducible HTS analysis.ResultsHere we present uap, a workflow management system dedicated to robust, consistent, and reproducible HTS data analysis. uap is optimized for the application to omics data, but can be easily extended to other complex analyses. It is available under the GNU GPL v3 license at https://github.com/yigbt/uap.Conclusionsuap is a freely available tool that enables researchers to easily adhere to reproducible research principles for HTS data analyses.

Download Full-text

The Uptake and Application of Work Flow Management Systems in the UK Financial Services Sector

Journal of Information Technology ◽

10.1177/026839629901400204 ◽

1999 ◽

Vol 14 (2) ◽

pp. 149-160 ◽

Cited By ~ 1

Author(s):

Neil F. Doherty ◽

I. Perry

Keyword(s):

Customer Service ◽

Financial Services ◽

Work Flow ◽

Management Systems ◽

Flow Management ◽

Services Sector ◽

Financial Services Sector ◽

Wide Range ◽

Work Flow Management ◽

The Way

Work flow management systems (WFMSs) are an important new technology, which are likely to have a significant impact on the way in which clerical and administrative operations are organized and executed. This paper seeks to investigate how WFMSs are being exploited and used commercially by UK-based organizations operating in the financial services sector. In-depth interviews were conducted with 14 project managers to explore the development, application and commercial implications of this powerful yet flexible technology. The results indicate that work flow technology has the potential to facilitate significant changes in the way in which an organization conducts its business, through the automation of a wide range of document-intensive operations. Furthermore, when applied in a well-focused manner, it has the potential to realize significant increases in an organization's flexibility and productivity, as well as delivering major improvements to the quality, speed and consistency of customer service.

Download Full-text

Reproducible research and GIScience: an evaluation using AGILE conference papers

10.7287/peerj.preprints.26561v1 ◽

2018 ◽

Author(s):

Daniel Nüst ◽

Carlos Granell ◽

Barbara Hofer ◽

Markus Konkol ◽

Frank O Ostermann ◽

...

Keyword(s):

Data Analysis ◽

Information Science ◽

Geographic Information Science ◽

Geographic Information ◽

Reproducible Research ◽

Conference Series ◽

Current State ◽

Software Skills ◽

Computational Research ◽

Conference Papers

The demand for reproducibility of research is on the rise in disciplines concerned with data analysis and computational methods. In this work existing recommendations for reproducible research are reviewed and translated into criteria for assessing reproducibility of articles in the field of geographic information science (GIScience). Using a sample of GIScience research from the Association of Geographic Information Laboratories in Europe (AGILE) conference series, we assess the current state of reproducibility of publications in this field. Feedback on the assessment was collected by surveying the authors of the sample papers. The results show the reproducibility levels are low. Although authors support the ideals, the incentives are too small. Therefore we propose concrete actions for individual researchers and the AGILE conference series to improve transparency and reproducibility, such as imparting data and software skills, an award, paper badges, author guidelines for computational research, and Open Access publications.

Download Full-text

Using rapid prototyping to choose a bioinformatics workflow management system

10.1101/2020.08.04.236208 ◽

2020 ◽

Author(s):

Michael J. Jackson ◽

Edward Wallace ◽

Kostas Kavoussanakis

Keyword(s):

Data Analysis ◽

Rapid Prototyping ◽

Management System ◽

Low Cost ◽

Workflow Management ◽

Workflow Management System ◽

Management Systems ◽

Workflow Management Systems ◽

The Right ◽

Selection Of

AbstractWorkflow management systems represent, manage, and execute multi-step computational analyses and offer many benefits to bioinformaticians. They provide a common language for describing analysis workflows, contributing to reproducibility and to building libraries of reusable components. They can support both incremental build and re-entrancy – the ability to selectively re-execute parts of a workflow in the presence of additional inputs or changes in configuration and to resume execution from where a workflow previously stopped. Many workflow management systems enhance portability by supporting the use of containers, high-performance computing systems and clouds. Most importantly, workflow management systems allow bioinformaticians to delegate how their workflows are run to the workflow management system and its developers. This frees the bioinformaticians to focus on the content of these workflows, their data analyses, and their science.RiboViz is a package to extract biological insight from ribosome profiling data to help advance understanding of protein synthesis. At the heart of RiboViz is an analysis workflow, implemented in a Python script. To conform to best practices for scientific computing which recommend the use of build tools to automate workflows and to re-use code instead of rewriting it, the authors reimplemented this workflow within a workflow management system. To select a workflow management system, a rapid survey of available systems was undertaken, and candidates were shortlisted: Snakemake, cwltool and Toil (implementations of the Common Workflow Language) and Nextflow. An evaluation of each candidate, via rapid prototyping of a subset of the RiboViz workflow, was performed and Nextflow was chosen. The selection process took 10 person-days, a small cost for the assurance that Nextflow best satisfied the authors’ requirements. This use of rapid prototyping can offer a low-cost way of making a more informed selection of software to use within projects, rather than relying solely upon reviews and recommendations by others.Author summaryData analysis involves many steps, as data are wrangled, processed, and analysed using a succession of unrelated software packages. Running all the right steps, in the right order, with the right outputs in the right places is a major source of frustration. Workflow management systems require that each data analysis step be “wrapped” in a structured way, describing its inputs, parameters, and outputs. By writing these wrappers the scientist can focus on the meaning of each step, which is the interesting part. The system uses these wrappers to decide what steps to run and how to run these, and takes charge of running the steps, including reporting on errors. This makes it much easier to repeatedly run the analysis and to run it transparently upon different computers. To select a workflow management system, we surveyed available tools and selected three for “rapid prototype” implementations to evaluate their suitability for our project. We advocate this rapid prototyping as a low-cost (both time and effort) way of making an informed selection of a system for use within a project. We conclude that many similar multi-step data analysis workflows can be rewritten in a workflow management system.

Download Full-text

Towards an Internet of Science

Journal of Integrative Bioinformatics ◽

10.1515/jib-2019-0024 ◽

2019 ◽

Vol 16 (3) ◽

Author(s):

Jens Allmer

Keyword(s):

Data Analysis ◽

Complex Analysis ◽

Workflow Management ◽

Data Driven ◽

Management Systems ◽

Workflow Management Systems ◽

Read Mapping ◽

Computational Tools ◽

Public And Private ◽

Current State

AbstractBig data and complex analysis workflows (pipelines) are common issues in data driven science such as bioinformatics. Large amounts of computational tools are available for data analysis. Additionally, many workflow management systems to piece together such tools into data analysis pipelines have been developed. For example, more than 50 computational tools for read mapping are available representing a large amount of duplicated effort. Furthermore, it is unclear whether these tools are correct and only a few have a user base large enough to have encountered and reported most of the potential problems. Bringing together many largely untested tools in a computational pipeline must lead to unpredictable results. Yet, this is the current state. While presently data analysis is performed on personal computers/workstations/clusters, the future will see development and analysis shift to the cloud. None of the workflow management systems is ready for this transition. This presents the opportunity to build a new system, which will overcome current duplications of effort, introduce proper testing, allow for development and analysis in public and private clouds, and include reporting features leading to interactive documents.

Download Full-text

DataPackageR: Reproducible data preprocessing, standardization and sharing using R/Bioconductor for collaborative data analysis

Gates Open Research ◽

10.12688/gatesopenres.12832.1 ◽

2018 ◽

Vol 2 ◽

pp. 31 ◽

Cited By ~ 1

Author(s):

Greg Finak ◽

Bryan Mayer ◽

William Fulp ◽

Paul Obrecht ◽

Alicia Sato ◽

...

Keyword(s):

Data Analysis ◽

Data Processing ◽

Relational Databases ◽

Research Work ◽

Large Data ◽

Work Flow ◽

Primary Data ◽

Reproducible Research ◽

Data Sets ◽

Data Set

A central tenet of reproducible research is that scientific results are published along with the underlying data and software code necessary to reproduce and verify the findings. A host of tools and software have been released that facilitate such work-flows and scientific journals have increasingly demanded that code and primary data be made available with publications. There has been little practical advice on implementing reproducible research work-flows for large ’omics’ or systems biology data sets used by teams of analysts working in collaboration. In such instances it is important to ensure all analysts use the same version of a data set for their analyses. Yet, instantiating relational databases and standard operating procedures can be unwieldy, with high "startup" costs and poor adherence to procedures when they deviate substantially from an analyst’s usual work-flow. Ideally a reproducible research work-flow should fit naturally into an individual’s existing work-flow, with minimal disruption. Here, we provide an overview of how we have leveraged popular open source tools, including Bioconductor, Rmarkdown, git version control, R, and specifically R’s package system combined with a new tool DataPackageR, to implement a lightweight reproducible research work-flow for preprocessing large data sets, suitable for sharing among small-to-medium sized teams of computational scientists. Our primary contribution is the DataPackageR tool, which decouples time-consuming data processing from data analysis while leaving a traceable record of how raw data is processed into analysis-ready data sets. The software ensures packaged data objects are properly documented and performs checksum verification of these along with basic package version management, and importantly, leaves a record of data processing code in the form of package vignettes. Our group has implemented this work-flow to manage, analyze and report on pre-clinical immunological trial data from multi-center, multi-assay studies for the past three years.

Download Full-text

DataPackageR: Reproducible data preprocessing, standardization and sharing using R/Bioconductor for collaborative data analysis

10.1101/342907 ◽

2018 ◽

Author(s):

Greg Finak ◽

Bryan T. Mayer ◽

William Fulp ◽

Paul Obrecht ◽

Alicia Sato ◽

...

Keyword(s):

Data Analysis ◽

Data Processing ◽

Relational Databases ◽

Research Work ◽

Large Data ◽

Work Flow ◽

Primary Data ◽

Reproducible Research ◽

Data Sets ◽

Data Set

AbstractA central tenet of reproducible research is that scientific results are published along with the underlying data and software code necessary to reproduce and verify the findings. A host of tools and software have been released that facilitate such work-flows and scientific journals have increasingly demanded that code and primary data be made available with publications. There has been little practical advice on implementing reproducible research work-flows for large ‘omics’ or systems biology data sets used by teams of analysts working in collaboration. In such instances it is important to ensure all analysts use the same version of a data set for their analyses. Yet, instantiating relational databases and standard operating procedures can be unwieldy, with high “startup” costs and poor adherence to procedures when they deviate substantially from an analyst’s usual work-flow. Ideally a reproducible research work-flow should fit naturally into an individual’s existing work-flow, with minimal disruption. Here, we provide an overview of how we have leveraged popular open source tools, including Bioconductor, Rmarkdown, git version control, R, and specifically R’s package system combined with a new tool DataPackageR, to implement a lightweight reproducible research work-flow for preprocessing large data sets, suitable for sharing among small-to-medium sized teams of computational scientists. Our primary contribution is the DataPackageR tool, which decouples time-consuming data processing from data analysis while leaving a traceable record of how raw data is processed into analysis-ready data sets. The software ensures packaged data objects are properly documented and performs checksum verification of these along with basic package version management, and importantly, leaves a record of data processing code in the form of package vignettes. Our group has implemented this work-flow to manage, analyze and report on pre-clinical immunological trial data from multi-center, multi-assay studies for the past three years.

Download Full-text

Reproducible research and GIScience: an evaluation using AGILE conference papers

PeerJ ◽

10.7717/peerj.5072 ◽

2018 ◽

Vol 6 ◽

pp. e5072 ◽

Cited By ~ 12

Author(s):

Daniel Nüst ◽

Carlos Granell ◽

Barbara Hofer ◽

Markus Konkol ◽

Frank O. Ostermann ◽

...

Keyword(s):

Data Analysis ◽

Open Access ◽

Computational Methods ◽

Information Science ◽

Geographic Information Science ◽

Geographic Information ◽

Reproducible Research ◽

Conference Series ◽

Computational Research ◽

Conference Papers

The demand for reproducible research is on the rise in disciplines concerned with data analysis and computational methods. Therefore, we reviewed current recommendations for reproducible research and translated them into criteria for assessing the reproducibility of articles in the field of geographic information science (GIScience). Using this criteria, we assessed a sample of GIScience studies from the Association of Geographic Information Laboratories in Europe (AGILE) conference series, and we collected feedback about the assessment from the study authors. Results from the author feedback indicate that although authors support the concept of performing reproducible research, the incentives for doing this in practice are too small. Therefore, we propose concrete actions for individual researchers and the GIScience conference series to improve transparency and reproducibility. For example, to support researchers in producing reproducible work, the GIScience conference series could offer awards and paper badges, provide author guidelines for computational research, and publish articles in Open Access formats.

Download Full-text

Sustainable data analysis with Snakemake

F1000Research ◽

10.12688/f1000research.29032.2 ◽

2021 ◽

Vol 10 ◽

pp. 33

Author(s):

Felix Mölder ◽

Kim Philipp Jablonski ◽

Brice Letcher ◽

Michael B. Hall ◽

Christopher H. Tomkins-Tinch ◽

...

Keyword(s):

Quality Control ◽

Data Analysis ◽

Data Processing ◽

Research Group ◽

Workflow Management ◽

Command Line ◽

Fine Grained ◽

Data Analyses ◽

Technical Validation ◽

Research Questions

Data analysis often entails a multitude of heterogeneous steps, from the application of various command line tools to the usage of scripting languages like R or Python for the generation of plots and tables. It is widely recognized that data analyses should ideally be conducted in a reproducible way. Reproducibility enables technical validation and regeneration of results on the original or even new data. However, reproducibility alone is by no means sufficient to deliver an analysis that is of lasting impact (i.e., sustainable) for the field, or even just one research group. We postulate that it is equally important to ensure adaptability and transparency. The former describes the ability to modify the analysis to answer extended or slightly different research questions. The latter describes the ability to understand the analysis in order to judge whether it is not only technically, but methodologically valid. Here, we analyze the properties needed for a data analysis to become reproducible, adaptable, and transparent. We show how the popular workflow management system Snakemake can be used to guarantee this, and how it enables an ergonomic, combined, unified representation of all steps involved in data analysis, ranging from raw data processing, to quality control and fine-grained, interactive exploration and plotting of final results.

Download Full-text

Automation of in-silico data analysis processes through workflow management systems

Briefings in Bioinformatics ◽

10.1093/bib/bbm056 ◽

2007 ◽

Vol 9 (1) ◽

pp. 57-68 ◽

Cited By ~ 39

Author(s):

P. Romano

Keyword(s):

Data Analysis ◽

In Silico ◽

Workflow Management ◽

Management Systems ◽

Workflow Management Systems

Download Full-text