Streamlining data-intensive biology with workflow systems

Taylor Reiter; Phillip T Brooks†; Luiz Irber†; Shannon E K Joslin†; Charles M Reid†; Camille Scott†; C Titus Brown; N Tessa Pierce-Ward

doi:10.1093/gigascience/giaa140

Streamlining Data-Intensive Biology With Workflow Systems

10.1101/2020.06.30.178673 ◽

2020 ◽

Cited By ~ 1

Author(s):

Taylor Reiter ◽

Phillip T. Brooks ◽

Luiz Irber ◽

Shannon E.K. Joslin ◽

Charles M. Reid ◽

...

Keyword(s):

Data Analysis ◽

Large Scale ◽

High Throughput Sequencing ◽

Sequence Data ◽

Open Science ◽

Biological Data ◽

Data Generation ◽

Biological Sequence ◽

Sequencing Data ◽

Workflow Systems

AbstractAs the scale of biological data generation has increased, the bottleneck of research has shifted from data generation to analysis. Researchers commonly need to build computational workflows that include multiple analytic tools and require incremental development as experimental insights demand tool and parameter modifications. These workflows can produce hundreds to thousands of intermediate files and results that must be integrated for biological insight. Data-centric workflow systems that internally manage computational resources, software, and conditional execution of analysis steps are reshaping the landscape of biological data analysis, and empowering researchers to conduct reproducible analyses at scale. Adoption of these tools can facilitate and expedite robust data analysis, but knowledge of these techniques is still lacking. Here, we provide a series of practices and strategies for leveraging workflow systems with structured project, data, and resource management to streamline large-scale biological analysis. We present these strategies in the context of high-throughput sequencing data analysis, but the principles are broadly applicable to biologists working beyond this field.Author SummaryWe present a guide for workflow-enabled biological sequence data analysis, developed through our own teaching, training and analysis projects. We recognize that this is based on our own use cases and experiences, but we hope that our guide will contribute to a larger discussion within the open source and open science communities and lead to more comprehensive resources. Our main goal is to accelerate the research of scientists conducting sequence analyses by introducing them to organized workflow practices that not only benefit their own research but also facilitate open and reproducible science.

Download Full-text

Improvements and impacts of GRCh38 human reference on high throughput sequencing data analysis

Genomics ◽

10.1016/j.ygeno.2017.01.005 ◽

2017 ◽

Vol 109 (2) ◽

pp. 83-90 ◽

Cited By ~ 44

Author(s):

Yan Guo ◽

Yulin Dai ◽

Hui Yu ◽

Shilin Zhao ◽

David C. Samuels ◽

...

Keyword(s):

Data Analysis ◽

High Throughput ◽

High Throughput Sequencing ◽

Sequencing Data ◽

High Throughput Sequencing Data ◽

Sequencing Data Analysis

Download Full-text

ADFinder: accurate detection of programmed DNA elimination using NGS high-throughput sequencing data

Bioinformatics ◽

10.1093/bioinformatics/btaa226 ◽

2020 ◽

Vol 36 (12) ◽

pp. 3632-3636 ◽

Cited By ~ 2

Author(s):

Weibo Zheng ◽

Jing Chen ◽

Thomas G Doak ◽

Weibo Song ◽

Ying Yan

Keyword(s):

High Throughput ◽

Large Scale ◽

High Throughput Sequencing ◽

Supplementary Information ◽

Sequencing Data ◽

Source Codes ◽

High Throughput Sequencing Data ◽

Dna Elimination ◽

Multiple Alternative ◽

Dna Splicing

Abstract Motivation Programmed DNA elimination (PDE) plays a crucial role in the transitions between germline and somatic genomes in diverse organisms ranging from unicellular ciliates to multicellular nematodes. However, software specific for the detection of DNA splicing events is scarce. In this paper, we describe Accurate Deletion Finder (ADFinder), an efficient detector of PDEs using high-throughput sequencing data. ADFinder can predict PDEs with relatively low sequencing coverage, detect multiple alternative splicing forms in the same genomic location and calculate the frequency for each splicing event. This software will facilitate research of PDEs and all down-stream analyses. Results By analyzing genome-wide DNA splicing events in two micronuclear genomes of Oxytricha trifallax and Tetrahymena thermophila, we prove that ADFinder is effective in predicting large scale PDEs. Availability and implementation The source codes and manual of ADFinder are available in our GitHub website: https://github.com/weibozheng/ADFinder. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

xIP-seq Platform: An Integrative Framework for High-Throughput Sequencing Data Analysis

2009 Ohio Collaborative Conference on Bioinformatics ◽

10.1109/occbio.2009.20 ◽

2009 ◽

Cited By ~ 2

Author(s):

Xin Wang ◽

Mingxiang Teng ◽

Guohua Wang ◽

Yuming Zhao ◽

Xu Han ◽

...

Keyword(s):

Data Analysis ◽

High Throughput ◽

High Throughput Sequencing ◽

Sequencing Data ◽

Integrative Framework ◽

High Throughput Sequencing Data ◽

Sequencing Data Analysis

Download Full-text

Communicating Regulatory High Throughput Sequencing Data Using BioCompute Objects

10.1101/2020.12.07.415059 ◽

2020 ◽

Author(s):

Charles Hadley S. King ◽

Jonathon Keeney ◽

Nuria Guimera ◽

Souvik Das ◽

Brian Fochtman ◽

...

Keyword(s):

High Throughput Sequencing ◽

Biological Data ◽

Sequencing Data ◽

Regulatory Submission ◽

High Throughput Sequencing Data ◽

Analysis Workflow ◽

Regulatory Submissions ◽

High Concordance ◽

Next Generation Sequencing Ngs ◽

Ngs Data

AbstractFor regulatory submissions of next generation sequencing (NGS) data it is vital for the analysis workflow to be robust, reproducible, and understandable. This project demonstrates that the use of the IEEE 2791-2020 Standard, (BioCompute objects [BCO]) enables complete and concise communication of NGS data analysis results. One arm of a clinical trial was replicated using synthetically generated data made to resemble real biological data. Two separate, independent analyses were then carried out using BCOs as the tool for communication of analysis: one to simulate a pharmaceutical regulatory submission to the FDA, and another to simulate the FDA review. The two results were compared and tabulated for concordance analysis: of the 118 simulated patient samples generated, the final results of 117 (99.15%) were in agreement. This high concordance rate demonstrates the ability of a BCO, when a verification kit is included, to effectively capture and clearly communicate NGS analyses within regulatory submissions. BCO promotes transparency and induces reproducibility, thereby reinforcing trust in the regulatory submission process.

Download Full-text

HTSstation: A Web Application and Open-Access Libraries for High-Throughput Sequencing Data Analysis

PLoS ONE ◽

10.1371/journal.pone.0085879 ◽

2014 ◽

Vol 9 (1) ◽

pp. e85879 ◽

Cited By ~ 67

Author(s):

Fabrice P. A. David ◽

Julien Delafontaine ◽

Solenne Carat ◽

Frederick J. Ross ◽

Gregory Lefebvre ◽

...

Keyword(s):

Data Analysis ◽

Open Access ◽

High Throughput ◽

Web Application ◽

High Throughput Sequencing ◽

Sequencing Data ◽

High Throughput Sequencing Data ◽

Sequencing Data Analysis

Download Full-text

HaTSPiL: A modular pipeline for high-throughput sequencing data analysis

PLoS ONE ◽

10.1371/journal.pone.0222512 ◽

2019 ◽

Vol 14 (10) ◽

pp. e0222512

Author(s):

Edoardo Morandi ◽

Matteo Cereda ◽

Danny Incarnato ◽

Caterina Parlato ◽

Giulia Basile ◽

...

Keyword(s):

Data Analysis ◽

High Throughput ◽

High Throughput Sequencing ◽

Sequencing Data ◽

High Throughput Sequencing Data ◽

Sequencing Data Analysis

Download Full-text

An Open-Source Toolkit To Expand Bioinformatics Training in Infectious Diseases

mBio ◽

10.1128/mbio.01214-21 ◽

2021 ◽

Author(s):

Alexander S. F. Berry ◽

Camila Farias Amorim ◽

Corbett L. Berry ◽

Camille M. Syrett ◽

Elise D. English ◽

...

Keyword(s):

Infectious Disease ◽

Data Analysis ◽

High Throughput Sequencing ◽

Data Sets ◽

Data Generation ◽

Sequencing Data ◽

Sequencing Technology ◽

Didactic Instruction ◽

Host Parasite Interactions ◽

Host Parasite

As access to high-throughput sequencing technology has increased, the bottleneck in biomedical research has shifted from data generation to data analysis. Here, we describe a modular and extensible framework for didactic instruction in bioinformatics using publicly available RNA sequencing data sets from infectious disease studies, with a focus on host-parasite interactions.

Download Full-text

Succinct Dynamic de Bruijn Graphs

10.1101/2020.04.01.018481 ◽

2020 ◽

Cited By ~ 1

Author(s):

Bahar Alipanahi ◽

Alan Kuhnle ◽

Simon J. Puglisi ◽

Leena Salmela ◽

Christina Boucher

Keyword(s):

Data Structures ◽

Large Scale ◽

High Throughput Sequencing ◽

De Bruijn Graph ◽

Sequencing Data ◽

Efficient Manner ◽

De Bruijn Graphs ◽

High Throughput Sequencing Data ◽

Efficient Data ◽

De Bruijn

AbstractMotivationThe de Bruijn graph is one of the fundamental data structures for analysis of high throughput sequencing data. In order to be applicable to population-scale studies, it is essential to build and store the graph in a space- and time-efficient manner. In addition, due to the ever-changing nature of population studies, it has become essential to update the graph after construction e.g. add and remove nodes and edges. Although there has been substantial effort on making the construction and storage of the graph efficient, there is a limited amount of work in building the graph in an efficient and mutable manner. Hence, most space efficient data structures require complete reconstruction of the graph in order to add or remove edges or nodes.ResultsIn this paper we present DynamicBOSS, a succinct representation of the de Bruijn graph that allows for an unlimited number of additions and deletions of nodes and edges. We compare our method with other competing methods and demonstrate that DynamicBOSS is the only method that supports both addition and deletion and is applicable to very large samples (e.g. greater than 15 billion k-mers). Competing dynamic methods e.g., FDBG (Crawford et al., 2018) cannot be constructed on large scale datasets, or cannot support both addition and deletion e.g., BiFrost (Holley and Melsted, 2019).AvailabilityDynamicBOSS is publicly available at https://github.com/baharpan/[email protected]

Download Full-text

Broom: Application for non-redundant storage of High Throughput Sequencing data

10.1101/312306 ◽

2018 ◽

Author(s):

Levent Albayrak ◽

Kamil Khanipov ◽

George Golovko ◽

Yuriy Fofanov

Keyword(s):

Data Storage ◽

High Throughput ◽

High Throughput Sequencing ◽

Data Generation ◽

Sequencing Data ◽

High Throughput Sequencing Data ◽

Sequencing Quality ◽

Redundant Storage ◽

Recent Trends ◽

The Cost

AbstractMotivationThe data generation capabilities of High Throughput Sequencing (HTS) instruments have exponentially increased over the last few years, while the cost of sequencing has dramatically decreased allowing this technology to become widely used in biomedical studies. For small labs and individual researchers, however, storage and transfer of large amounts of HTS data present a significant challenge. The recent trends in increased sequencing quality and genome coverage can be used to reconsider HTS data storage strategies.ResultsWe present Broom, a stand-alone application designed to select and store only high-quality sequencing reads at extremely high compression rates. Written in C++, the application accepts single and paired-end reads in FASTQ and FASTA formats and decompresses data in FASTA format.AvailabilityC++ code available at https://scsb.utmb.edu/labgroups/fofanov/[email protected]

Download Full-text