scholarly journals elPrep: A multithreaded framework for sequence analysis

2018 ◽  
Author(s):  
Charlotte Herzeel ◽  
Pascal Costanza ◽  
Dries Decap ◽  
Jan Fostier ◽  
Wilfried Verachtert

We present elPrep 4, a reimplementation from scratch of the elPrep framework for processing sequence alignment map files in the Go programming language. elPrep 4 includes multiple new features allowing us to process all of the preparation steps defined by the GATK Best Practice pipelines for variant calling. This includes new and improved functionality for sorting, (optical) duplicate marking, base quality score recalibration, BED and VCF parsing, and various filtering options. The implementations of these options in elPrep 4 faithfully reproduce the outcomes of their counterparts in GATK 4, SAMtools, and Picard, even though the underlying algorithms are redesigned to take advantage of elPrep's parallel execution framework to vastly improve the runtime and resource use compared to these tools. Our benchmarks show that elPrep executes the preparation steps of the GATK Best Practices up to 13x faster on WES data, and up to 7.4x faster for WGS data compared to running the same pipeline with GATK 4, while utilizing fewer compute resources.

2020 ◽  
Author(s):  
Charlotte Herzeel ◽  
Pascal Costanza ◽  
Dries Decap ◽  
Jan Fostier ◽  
Roel Wuyts ◽  
...  

AbstractWe present elPrep 5, which updates the elPrep framework for processing sequencing alignment/map files with variant calling. elPrep 5 can now execute the full pipeline described by the GATK Best Practices for variant calling, which consists of PCR and optical duplicate marking, sorting by coordinate order, base quality score recalibration, and variant calling using the haplotype caller algorithm. elPrep 5 produces identical BAM and VCF output as GATK4 while significantly reducing the runtime by parallelizing and merging the execution of the pipeline steps. Our benchmarks show that elPrep 5 speeds up the runtime of the variant calling pipeline by a factor 8-16x on both whole-exome and whole-genome data while using the same hardware resources as GATK 4. This makes elPrep 5 a suitable drop-in replacement for GATK 4 when faster execution times are needed.


PLoS ONE ◽  
2021 ◽  
Vol 16 (2) ◽  
pp. e0244471
Author(s):  
Charlotte Herzeel ◽  
Pascal Costanza ◽  
Dries Decap ◽  
Jan Fostier ◽  
Roel Wuyts ◽  
...  

We present elPrep 5, which updates the elPrep framework for processing sequencing alignment/map files with variant calling. elPrep 5 can now execute the full pipeline described by the GATK Best Practices for variant calling, which consists of PCR and optical duplicate marking, sorting by coordinate order, base quality score recalibration, and variant calling using the haplotype caller algorithm. elPrep 5 produces identical BAM and VCF output as GATK4 while significantly reducing the runtime by parallelizing and merging the execution of the pipeline steps. Our benchmarks show that elPrep 5 speeds up the runtime of the variant calling pipeline by a factor 8-16x on both whole-exome and whole-genome data while using the same hardware resources as GATK4. This makes elPrep 5 a suitable drop-in replacement for GATK4 when faster execution times are needed.


2020 ◽  
Vol 12 (1) ◽  
Author(s):  
Daniel C. Koboldt

Abstract Next-generation sequencing technologies have enabled a dramatic expansion of clinical genetic testing both for inherited conditions and diseases such as cancer. Accurate variant calling in NGS data is a critical step upon which virtually all downstream analysis and interpretation processes rely. Just as NGS technologies have evolved considerably over the past 10 years, so too have the software tools and approaches for detecting sequence variants in clinical samples. In this review, I discuss the current best practices for variant calling in clinical sequencing studies, with a particular emphasis on trio sequencing for inherited disorders and somatic mutation detection in cancer patients. I describe the relative strengths and weaknesses of panel, exome, and whole-genome sequencing for variant detection. Recommended tools and strategies for calling variants of different classes are also provided, along with guidance on variant review, validation, and benchmarking to ensure optimal performance. Although NGS technologies are continually evolving, and new capabilities (such as long-read single-molecule sequencing) are emerging, the “best practice” principles in this review should be relevant to clinical variant calling in the long term.


2017 ◽  
Author(s):  
Jade C.S. Chung ◽  
Swaine L. Chen

AbstractNext-generation sequencing data is accompanied by quality scores that quantify sequencing error. Inaccuracies in these quality scores propagate through all subsequent analyses; thus base quality score recalibration is a standard step in many next-generation sequencing workflows, resulting in improved variant calls. Current base quality score recalibration algorithms rely on the assumption that sequencing errors are already known; for human resequencing data, relatively complete variant databases facilitate this. However, because existing databases are still incomplete, recalibration is still inaccurate; and most organisms do not have variant databases, exacerbating inaccuracy for non-human data. To overcome these logical and practical problems, we introduce Lacer, which recalibrates base quality scores without assuming knowledge of correct and incorrect bases and without requiring knowledge of common variants. Lacer is the first logically sound, fully general, and truly accurate base recalibrator. Lacer enhances variant identification accuracy for resequencing data of human as well as other organisms (which are not accessible to current recalibrators), simultaneously improving and extending the benefits of base quality score recalibration to nearly all ongoing sequencing projects. Lacer is available at: https://github.com/swainechen/lacer.


2020 ◽  
Author(s):  
Olga Krasheninina ◽  
Yih-Chii Hwang ◽  
Xiaodong Bai ◽  
Aleksandra Zalcman ◽  
Evan Maxwell ◽  
...  

AbstractStandardized genome informatics protocols minimize reprocessing costs and facilitate harmonization across studies if implemented in a transparent, accessible and reproducible manner. Here we define the OQFE protocol, a lossless read-mapping protocol that retains key features of existing NGS standard methods. We demonstrate that variants can be called directly from NovaSeq OQFE data without the need for base quality score recalibration and describe a large-scale variant calling protocol for OQFE data. The OQFE protocol is open-source and a containerized implementation is provided.


2021 ◽  
Author(s):  
Jochen Bathke ◽  
Gesine Lühken

Background Next generation sequencing technologies are opening new doors to researchers. One application is the direct discovery of sequence variants that are causative for a phenotypic trait or a disease. The detection of an organisms alterations from a reference genome is know as variant calling, a computational task involving a complex chain of software applications. One key player in the field is the Genome Analysis Toolkit (GATK). The GATK Best Practices are commonly referred recipe for variant calling on human sequencing data. Still the fact the Best Practices are highly specialized on human sequencing data and are permanently evolving is often ignored. Reproducibility is thereby aggravated, leading to continuous reinvention of pretended GATK Best Practice workflows. Results Here we present an automatized variant calling workflow, for the detection of SNPs and indels, that is broadly applicable for model as well as non-model diploid organisms. It is derived from the GATK Best Practice workflow for "Germline short variant discovery", without being focused on human sequencing data. The workflow has been highly optimized to achieve parallelized data evaluation and also maximize performance of individual applications to shorten overall analysis time. Optimized Java garbage collection and heap size settings for the GATK applications SortSam, MarkDuplicates, HaplotypeCaller and GatherVcfs were determined by thorough benchmarking. In doing so, runtimes of an example data evaluation could be reduced from 67 h to less than 35 h. Conclusions The demand for standardized variant calling workflows is proportionally growing with the dropping costs of next generation sequencing methods. Our workflow perfectly fits into this niche, offering automatization, reproducibility and documentation of the variant calling process. Moreover resource usage is lowered to a minimum. Thereby variant calling projects should become more standardized, reducing the barrier further for smaller institutions or groups.


Symmetry ◽  
2020 ◽  
Vol 13 (1) ◽  
pp. 9
Author(s):  
John H. Graham

Best practices in studies of developmental instability, as measured by fluctuating asymmetry, have developed over the past 60 years. Unfortunately, they are haphazardly applied in many of the papers submitted for review. Most often, research designs suffer from lack of randomization, inadequate replication, poor attention to size scaling, lack of attention to measurement error, and unrecognized mixtures of additive and multiplicative errors. Here, I summarize a set of best practices, especially in studies that examine the effects of environmental stress on fluctuating asymmetry.


Author(s):  
Shyam Prabhakaran ◽  
Renee M Sednew ◽  
Kathleen O’Neill

Background: There remains significant opportunities to reduce door-to-needle (DTN) times for stroke despite regional and national efforts. In Chicago, Quality Enhancement for the Speedy Thrombolysis for Stroke (QUESTS) was a one year learning collaborative (LC) which aimed to reduce DTN times at 15 Chicago Primary Stroke Centers. Identification of barriers and sharing of best practices resulted in achieving DTN < 60 minutes within the first quarter of the 2013 initiative and has sustained progress to date. Aligned with Target: Stroke goals, QUESTS 2.0, funded for the 2016 calendar year, invited 9 additional metropolitan Chicago area hospitals to collaborate and further reduce DTN times to a goal < 45 minutes in 50% of eligible patients. Methods: All 24 hospitals participate in the Get With The Guidelines (GWTG) Stroke registry and benchmark group to track DTN performance improvement in 2016. Hospitals implement American Heart Association’s Target Stroke program and share best practices uniquely implemented at sites to reduce DTN times. The LC included a quality and performance improvement leader, a stroke content expert, site visits and quarterly meetings and learning sessions, and reporting of experiences and data. Results: In 2015, the year prior to QUESTS 2.0, the proportion of patients treated with tPA within 45 minutes of hospital arrival increased from 21.6% in Q1 to 31.4% in Q2. During the 2016 funded year, this proportion changed from 31.6% in Q1 to 48.3% in Q2. Conclusions: Using a learning collaborative model to implement strategies to reduce DTN times among 24 Chicago area hospitals continues to impact times. Regional collaboration, data sharing, and best practice sharing should be a model for rapid and sustainable system-wide quality improvement.


Sign in / Sign up

Export Citation Format

Share Document