Literate Data Analysis with Stata and Markdown

In this article, I introduce markstat, a command for combining Stata code and output with comments and annotations written in Markdown into a beautiful webpage or PDF file, thus encouraging literate programming and reproducible research. The command tangles the input separating Stata and Markdown code, runs the Stata code, relies on Pandoc to process the Markdown code, and then weaves the outputs into a single file. HTML documents may include inline and display math using MathJax. Generating PDF output requires access to LATEX and a style file from Stata but works with the same input file.

Download Full-text

Facilitating Collaborative Analysis in SWAN

EPJ Web of Conferences ◽

10.1051/epjconf/201921407022 ◽

2019 ◽

Vol 214 ◽

pp. 07022

Author(s):

Enrico Bocchi ◽

Diogo Castro ◽

Hugo Gonzalez ◽

Massimo Lamanna ◽

Pere Mato ◽

...

Keyword(s):

Data Analysis ◽

Special Kind ◽

Service Model ◽

Web Browser ◽

Web Based ◽

Single File ◽

Cloud Storage Service ◽

Collaborative Analysis ◽

Interactive Data ◽

The Ideal

SWAN (Service for Web-based ANalysis) is a CERN service that allows users to perform interactive data analysis in the cloud, in a “software as a service” model. It is built upon the widely-used Jupyter notebooks, allowing users to write - and run - their data analysis using only a web browser. By connecting to SWAN, users have immediate access to storage, software and computing resources that CERN provides and that they need to do their analyses. Besides providing an easier way of producing scientific code and results, SWAN is also a great tool to create shareable content. From results that need to be reproducible, to tutorials and demonstrations for outreach and teaching, Jupyter notebooks are the ideal way of distributing this content. In one single file, users can include their code, the results of the calculations and all the relevant textual information. By sharing them, it allows others to visualize, modify, personalize or even re-run all the code. In that sense, this paper describes the efforts made to facilitate sharing in SWAN. Given the importance of collaboration in our scientific community, we have brought the sharing functionality from CERNBox, CERN’s cloud storage service, directly inside SWAN. SWAN users have available a new and redesigned interface where theycan share “Projects”: a special kind of folder containing notebooks and other files, e.g., like input datasets and images. When a user shares a Project with some other users, the latter can immediately see andwork with the contents of that project from SWAN.

Download Full-text

ColiCoords: A Python package for the analysis of bacterial fluorescence microscopy data

10.1101/608109 ◽

2019 ◽

Author(s):

Jochem H. Smit ◽

Yichen Li ◽

Eliza M. Warszawik ◽

Andreas Herrmann ◽

Thorben Cordes

Keyword(s):

Data Analysis ◽

Fluorescence Microscopy ◽

Single Molecule ◽

Statistical Significance ◽

Reproducible Research ◽

Cellular Processes ◽

Analysis Package ◽

Microscopy Data ◽

The Cost ◽

Single Molecule Localization Microscopy

AbstractSingle-molecule fluorescence microscopy studies of bacteria provide unique insights into the mechanisms of cellular processes and protein machineries in ways that are unrivalled by any other technique. With the cost of microscopes dropping and the availability of fully automated microscopes, the volume of microscopy data produced has increased tremendously. These developments have moved the bottleneck of throughput from image acquisition and sample preparation to data analysis. Furthermore, requirements for analysis procedures have become more stringent given the requirement of various journals to make data and analysis procedures available. To address this we have developed a new data analysis package for analysis of fluorescence microscopy data of rod-like cells. Our software ColiCoords structures microscopy data at the single-cell level and implements a coordinate system describing each cell. This allows for the transformation of Cartesian coordinates of both cellular images (e.g. from transmission light or fluorescence microscopy) and single-molecule localization microscopy (SMLM) data to cellular coordinates. Using this transformation, many cells can be combined to increase the statistical significance of fluorescence microscopy datasets of any kind. Coli-Coords is open source, implemented in the programming language Python, and is extensively documented. This allows for modifications for specific needs or to inspect and publish data analysis procedures. By providing a format that allows for easy sharing of code and associated data, we intend to promote open and reproducible research.The source code and documentation can be found via the project’s GitHub page.

Download Full-text

Developing and deploying an integrated workshop curriculum teaching computational skills for reproducible research

10.1101/2021.06.15.448091 ◽

2021 ◽

Author(s):

Zena Lapp ◽

Kelly L Sovacool ◽

Nicholas A Lesniak ◽

Dana King ◽

Catherine Barnier ◽

...

Keyword(s):

Data Analysis ◽

Learning Objectives ◽

Version Control ◽

Reproducible Research ◽

Entry Barrier ◽

Individual Level ◽

New Curriculum ◽

Programming Skills ◽

R Programming ◽

Reporting Data

Inspired by well-established material and pedagogy provided by The Carpentries, we developed a two-day workshop curriculum that teaches introductory R programming for managing, analyzing, plotting and reporting data using packages from the tidyverse, the Unix shell, version control with git, and GitHub. While the official Software Carpentry curriculum is comprehensive, we found that it contains too much content for a two-day workshop. We also felt that the independent nature of the lessons left learners confused about how to integrate the newly acquired programming skills in their own work. Thus, we developed a new curriculum (https://umcarpentries.org/intro-curriculum-r/) that aims to teach novices how to implement reproducible research principles in their own data analysis. The curriculum integrates live coding lessons with individual-level and group-based practice exercises, and also serves as a succinct resource that learners can reference both during and after the workshop. Moreover, it lowers the entry barrier for new instructors as they do not have to develop their own teaching materials or sift through extensive content. We developed this curriculum during a two-day sprint, successfully used it to host a two-day virtual workshop with almost 40 participants, and updated the material based on instructor and learner feedback. We hope that our new curriculum will prove useful to future instructors interested in teaching workshops with similar learning objectives.

Download Full-text

Accessible and reproducible mass spectrometry imaging data analysis in Galaxy

10.1101/628719 ◽

2019 ◽

Author(s):

Melanie Christine Föll ◽

Lennart Moritz ◽

Thomas Wollmann ◽

Maren Nicole Stillger ◽

Niklas Vockert ◽

...

Keyword(s):

Mass Spectrometry ◽

Data Analysis ◽

Mass Spectrometry Imaging ◽

Ease Of Use ◽

Reproducible Research ◽

Data Sets ◽

Imaging Data ◽

Available N ◽

The Galaxy ◽

Analysis Platform

AbstractBackgroundMass spectrometry imaging is increasingly used in biological and translational research as it has the ability to determine the spatial distribution of hundreds of analytes in a sample. Being at the interface of proteomics/metabolomics and imaging, the acquired data sets are large and complex and often analyzed with proprietary software or in-house scripts, which hinder reproducibility. Open source software solutions that enable reproducible data analysis often require programming skills and are therefore not accessible to many MSI researchers.FindingsWe have integrated 18 dedicated mass spectrometry imaging tools into the Galaxy framework to allow accessible, reproducible, and transparent data analysis. Our tools are based on Cardinal, MALDIquant, and scikit-image and enable all major MSI analysis steps such as quality control, visualization, preprocessing, statistical analysis, and image co-registration. Further, we created hands-on training material for use cases in proteomics and metabolomics. To demonstrate the utility of our tools, we re-analyzed a publicly available N-linked glycan imaging dataset. By providing the entire analysis history online, we highlight how the Galaxy framework fosters transparent and reproducible research.ConclusionThe Galaxy framework has emerged as a powerful analysis platform for the analysis of MSI data with ease of use and access together with high levels of reproducibility and transparency.

Download Full-text

Best Practices in Data Analysis and Sharing in Neuroimaging using MEEG

10.31219/osf.io/a8dhx ◽

2018 ◽

Cited By ~ 12

Author(s):

Cyril R Pernet ◽

Marta Garrido ◽

Alexandre Gramfort ◽

Natasha Maurits ◽

Christoph Michel ◽

...

Keyword(s):

Data Analysis ◽

Brain Function ◽

Best Practice ◽

Scientific Practice ◽

Open Science ◽

Reproducible Research ◽

Manuscript Review ◽

Non Invasive ◽

Practice Recommendations ◽

Data Practices

Non-invasive neuroimaging methods, including magnetoencephalography and electroencephalography (MEEG), have been critical in advancing the understanding of brain function in healthy people and in individuals with neurological or psychiatric disorders. Currently, scientific practice is undergoing a tremendous change, aiming to improve both research reproducibility and transparency in data collection, documentation and analysis, and in manuscript review. To advance the practice of open science, the Organization for Human Brain Mapping created the Committee on Best Practice in Data Analysis and Sharing (COBIDAS), which produced a report for MRI-based data in 2016. This effort continues with the OHBM’s COBIDAS MEEG committee whose task was to create a similar document that describes best practice recommendations for MEEG data. The document was drafted by OHBM experts in MEEG, with input from the world-wide brain imaging community, including OHBM members who volunteered to help with this effort, as well as Executive Committee members of the International Federation for Clinical Neurophysiology. This document outlines the principles of performing open and reproducible research in MEEG. Not all MEEG data practices are described in this document. Instead, we propose principles that we believe are current best practice for most recordings and common analyses. Furthermore, we suggest reporting guidelines for Authors that will enable others in the field to fully understand and potentially replicate any study. This document should be helpful to Authors, Reviewers of manuscripts, as well as Editors of neuroscience journals.

Download Full-text

uap: reproducible and robust HTS data analysis

BMC Bioinformatics ◽

10.1186/s12859-019-3219-1 ◽

2019 ◽

Vol 20 (1) ◽

Cited By ~ 1

Author(s):

Christoph Kämpf ◽

Michael Specht ◽

Alexander Scholz ◽

Sven-Holger Puppel ◽

Gero Doose ◽

...

Keyword(s):

Data Analysis ◽

High Throughput Sequencing ◽

Workflow Management ◽

Work Flow ◽

Management Systems ◽

Reproducible Research ◽

Bioinformatic Tools ◽

Data Analyses ◽

Computational Research ◽

Work Flow Management

Abstract Background A lack of reproducibility has been repeatedly criticized in computational research. High throughput sequencing (HTS) data analysis is a complex multi-step process. For most of the steps a range of bioinformatic tools is available and for most tools manifold parameters need to be set. Due to this complexity, HTS data analysis is particularly prone to reproducibility and consistency issues. We have defined four criteria that in our opinion ensure a minimal degree of reproducible research for HTS data analysis. A series of workflow management systems is available for assisting complex multi-step data analyses. However, to the best of our knowledge, none of the currently available work flow management systems satisfies all four criteria for reproducible HTS analysis. Results Here we present , a workflow management system dedicated to robust, consistent, and reproducible HTS data analysis. is optimized for the application to omics data, but can be easily extended to other complex analyses. It is available under the GNU GPL v3 license at https://github.com/yigbt/uap. Conclusions is a freely available tool that enables researchers to easily adhere to reproducible research principles for HTS data analyses.

Download Full-text

Accessible and reproducible mass spectrometry imaging data analysis in Galaxy

GigaScience ◽

10.1093/gigascience/giz143 ◽

2019 ◽

Vol 8 (12) ◽

Cited By ~ 3

Author(s):

Melanie Christine Föll ◽

Lennart Moritz ◽

Thomas Wollmann ◽

Maren Nicole Stillger ◽

Niklas Vockert ◽

...

Keyword(s):

Mass Spectrometry ◽

Data Analysis ◽

Mass Spectrometry Imaging ◽

Ease Of Use ◽

Reproducible Research ◽

Imaging Data ◽

Available N ◽

The Galaxy ◽

Msi Analysis ◽

Analysis Platform

Abstract Background Mass spectrometry imaging is increasingly used in biological and translational research because it has the ability to determine the spatial distribution of hundreds of analytes in a sample. Being at the interface of proteomics/metabolomics and imaging, the acquired datasets are large and complex and often analyzed with proprietary software or in-house scripts, which hinders reproducibility. Open source software solutions that enable reproducible data analysis often require programming skills and are therefore not accessible to many mass spectrometry imaging (MSI) researchers. Findings We have integrated 18 dedicated mass spectrometry imaging tools into the Galaxy framework to allow accessible, reproducible, and transparent data analysis. Our tools are based on Cardinal, MALDIquant, and scikit-image and enable all major MSI analysis steps such as quality control, visualization, preprocessing, statistical analysis, and image co-registration. Furthermore, we created hands-on training material for use cases in proteomics and metabolomics. To demonstrate the utility of our tools, we re-analyzed a publicly available N-linked glycan imaging dataset. By providing the entire analysis history online, we highlight how the Galaxy framework fosters transparent and reproducible research. Conclusion The Galaxy framework has emerged as a powerful analysis platform for the analysis of MSI data with ease of use and access, together with high levels of reproducibility and transparency.

Download Full-text

Reproducible research and GIScience: an evaluation using AGILE conference papers

10.7287/peerj.preprints.26561v1 ◽

2018 ◽

Author(s):

Daniel Nüst ◽

Carlos Granell ◽

Barbara Hofer ◽

Markus Konkol ◽

Frank O Ostermann ◽

...

Keyword(s):

Data Analysis ◽

Information Science ◽

Geographic Information Science ◽

Geographic Information ◽

Reproducible Research ◽

Conference Series ◽

Current State ◽

Software Skills ◽

Computational Research ◽

Conference Papers

The demand for reproducibility of research is on the rise in disciplines concerned with data analysis and computational methods. In this work existing recommendations for reproducible research are reviewed and translated into criteria for assessing reproducibility of articles in the field of geographic information science (GIScience). Using a sample of GIScience research from the Association of Geographic Information Laboratories in Europe (AGILE) conference series, we assess the current state of reproducibility of publications in this field. Feedback on the assessment was collected by surveying the authors of the sample papers. The results show the reproducibility levels are low. Although authors support the ideals, the incentives are too small. Therefore we propose concrete actions for individual researchers and the AGILE conference series to improve transparency and reproducibility, such as imparting data and software skills, an award, paper badges, author guidelines for computational research, and Open Access publications.

Download Full-text

uap: Reproducible and Robust HTS Data Analysis

10.1101/690438 ◽

2019 ◽

Author(s):

Christoph Kämpf ◽

Michael Specht ◽

Alexander Scholz ◽

Sven-Holger Puppel ◽

Gero Doose ◽

...

Keyword(s):

Data Analysis ◽

High Throughput Sequencing ◽

Workflow Management ◽

Work Flow ◽

Management Systems ◽

Reproducible Research ◽

Bioinformatic Tools ◽

Data Analyses ◽

Computational Research ◽

Work Flow Management

AbstractBackgroundA lack of reproducibility has been repeatedly criticized in computational research. High throughput sequencing (HTS) data analysis is a complex multi-step process. For most of the steps a range of bioinformatic tools is available and for most tools manifold parameters need to be set. Due to this complexity, HTS data analysis is particularly prone to reproducibility and consistency issues. We have defined four criteria that in our opinion ensure a minimal degree of reproducible research for HTS data analysis. A series of workflow management systems is available for assisting complex multi-step data analyses. However, to the best of our knowledge, none of the currently available work flow management systems satisfies all four criteria for reproducible HTS analysis.ResultsHere we present uap, a workflow management system dedicated to robust, consistent, and reproducible HTS data analysis. uap is optimized for the application to omics data, but can be easily extended to other complex analyses. It is available under the GNU GPL v3 license at https://github.com/yigbt/uap.Conclusionsuap is a freely available tool that enables researchers to easily adhere to reproducible research principles for HTS data analyses.

Download Full-text

Reproducible Research for Large-Scale Data Analysis

Implementing Reproducible Research ◽

10.1201/9781315373461-8 ◽

2018 ◽

pp. 219-239

Author(s):

Holger Hoefling ◽

Anthony Rossini

Keyword(s):

Data Analysis ◽

Large Scale ◽

Reproducible Research ◽

Large Scale Data ◽

Scale Data

Download Full-text