elPrep: A multithreaded framework for sequence analysis

We present elPrep 4, a reimplementation from scratch of the elPrep framework for processing sequence alignment map files in the Go programming language. elPrep 4 includes multiple new features allowing us to process all of the preparation steps defined by the GATK Best Practice pipelines for variant calling. This includes new and improved functionality for sorting, (optical) duplicate marking, base quality score recalibration, BED and VCF parsing, and various filtering options. The implementations of these options in elPrep 4 faithfully reproduce the outcomes of their counterparts in GATK 4, SAMtools, and Picard, even though the underlying algorithms are redesigned to take advantage of elPrep's parallel execution framework to vastly improve the runtime and resource use compared to these tools. Our benchmarks show that elPrep executes the preparation steps of the GATK Best Practices up to 13x faster on WES data, and up to 7.4x faster for WGS data compared to running the same pipeline with GATK 4, while utilizing fewer compute resources.

Download Full-text

Multithreaded variant calling in elPrep 5

10.1101/2020.12.11.421073 ◽

2020 ◽

Author(s):

Charlotte Herzeel ◽

Pascal Costanza ◽

Dries Decap ◽

Jan Fostier ◽

Roel Wuyts ◽

...

Keyword(s):

Best Practices ◽

Variant Calling ◽

Quality Score ◽

Whole Genome ◽

Genome Data ◽

Whole Exome ◽

Base Quality Score ◽

Execution Times

AbstractWe present elPrep 5, which updates the elPrep framework for processing sequencing alignment/map files with variant calling. elPrep 5 can now execute the full pipeline described by the GATK Best Practices for variant calling, which consists of PCR and optical duplicate marking, sorting by coordinate order, base quality score recalibration, and variant calling using the haplotype caller algorithm. elPrep 5 produces identical BAM and VCF output as GATK4 while significantly reducing the runtime by parallelizing and merging the execution of the pipeline steps. Our benchmarks show that elPrep 5 speeds up the runtime of the variant calling pipeline by a factor 8-16x on both whole-exome and whole-genome data while using the same hardware resources as GATK 4. This makes elPrep 5 a suitable drop-in replacement for GATK 4 when faster execution times are needed.

Download Full-text

Multithreaded variant calling in elPrep 5

PLoS ONE ◽

10.1371/journal.pone.0244471 ◽

2021 ◽

Vol 16 (2) ◽

pp. e0244471

Author(s):

Charlotte Herzeel ◽

Pascal Costanza ◽

Dries Decap ◽

Jan Fostier ◽

Roel Wuyts ◽

...

Keyword(s):

Best Practices ◽

Variant Calling ◽

Quality Score ◽

Whole Genome ◽

Genome Data ◽

Whole Exome ◽

Base Quality Score ◽

Execution Times

We present elPrep 5, which updates the elPrep framework for processing sequencing alignment/map files with variant calling. elPrep 5 can now execute the full pipeline described by the GATK Best Practices for variant calling, which consists of PCR and optical duplicate marking, sorting by coordinate order, base quality score recalibration, and variant calling using the haplotype caller algorithm. elPrep 5 produces identical BAM and VCF output as GATK4 while significantly reducing the runtime by parallelizing and merging the execution of the pipeline steps. Our benchmarks show that elPrep 5 speeds up the runtime of the variant calling pipeline by a factor 8-16x on both whole-exome and whole-genome data while using the same hardware resources as GATK4. This makes elPrep 5 a suitable drop-in replacement for GATK4 when faster execution times are needed.

Download Full-text

Best practices for variant calling in clinical sequencing

Genome Medicine ◽

10.1186/s13073-020-00791-w ◽

2020 ◽

Vol 12 (1) ◽

Author(s):

Daniel C. Koboldt

Keyword(s):

Best Practices ◽

Single Molecule ◽

Best Practice ◽

Variant Calling ◽

Clinical Samples ◽

Clinical Genetic ◽

Inherited Disorders ◽

Clinical Sequencing ◽

Sequencing Technologies ◽

Downstream Analysis

Abstract Next-generation sequencing technologies have enabled a dramatic expansion of clinical genetic testing both for inherited conditions and diseases such as cancer. Accurate variant calling in NGS data is a critical step upon which virtually all downstream analysis and interpretation processes rely. Just as NGS technologies have evolved considerably over the past 10 years, so too have the software tools and approaches for detecting sequence variants in clinical samples. In this review, I discuss the current best practices for variant calling in clinical sequencing studies, with a particular emphasis on trio sequencing for inherited disorders and somatic mutation detection in cancer patients. I describe the relative strengths and weaknesses of panel, exome, and whole-genome sequencing for variant detection. Recommended tools and strategies for calling variants of different classes are also provided, along with guidance on variant review, validation, and benchmarking to ensure optimal performance. Although NGS technologies are continually evolving, and new capabilities (such as long-read single-molecule sequencing) are emerging, the “best practice” principles in this review should be relevant to clinical variant calling in the long term.

Download Full-text

Lacer: accurate base quality score recalibration for improving variant calling from next-generation sequencing data in any organism

10.1101/130732 ◽

2017 ◽

Author(s):

Jade C.S. Chung ◽

Swaine L. Chen

Keyword(s):

Next Generation Sequencing ◽

Variant Calling ◽

Quality Score ◽

Identification Accuracy ◽

Next Generation Sequencing Data ◽

Sequencing Error ◽

Next Generation ◽

Sequencing Data ◽

Base Quality Score ◽

Generation Sequencing

AbstractNext-generation sequencing data is accompanied by quality scores that quantify sequencing error. Inaccuracies in these quality scores propagate through all subsequent analyses; thus base quality score recalibration is a standard step in many next-generation sequencing workflows, resulting in improved variant calls. Current base quality score recalibration algorithms rely on the assumption that sequencing errors are already known; for human resequencing data, relatively complete variant databases facilitate this. However, because existing databases are still incomplete, recalibration is still inaccurate; and most organisms do not have variant databases, exacerbating inaccuracy for non-human data. To overcome these logical and practical problems, we introduce Lacer, which recalibrates base quality scores without assuming knowledge of correct and incorrect bases and without requiring knowledge of common variants. Lacer is the first logically sound, fully general, and truly accurate base recalibrator. Lacer enhances variant identification accuracy for resequencing data of human as well as other organisms (which are not accessible to current recalibrators), simultaneously improving and extending the benefits of base quality score recalibration to nearly all ongoing sequencing projects. Lacer is available at: https://github.com/swainechen/lacer.

Download Full-text

Open-source mapping and variant calling for large-scale NGS data from original base-quality scores

10.1101/2020.12.15.356360 ◽

2020 ◽

Author(s):

Olga Krasheninina ◽

Yih-Chii Hwang ◽

Xiaodong Bai ◽

Aleksandra Zalcman ◽

Evan Maxwell ◽

...

Keyword(s):

Open Source ◽

Large Scale ◽

Variant Calling ◽

Quality Score ◽

Read Mapping ◽

Base Quality Score ◽

Key Features ◽

Ngs Data ◽

Genome Informatics ◽

Reproducible Manner

AbstractStandardized genome informatics protocols minimize reprocessing costs and facilitate harmonization across studies if implemented in a transparent, accessible and reproducible manner. Here we define the OQFE protocol, a lossless read-mapping protocol that retains key features of existing NGS standard methods. We demonstrate that variants can be called directly from NovaSeq OQFE data without the need for base quality score recalibration and describe a large-scale variant calling protocol for OQFE data. The OQFE protocol is open-source and a containerized implementation is provided.

Download Full-text

OVarFlow: a resource optimized GATK 4 based Open source Variant calling workFlow

10.1101/2021.05.12.443585 ◽

2021 ◽

Author(s):

Jochen Bathke ◽

Gesine Lühken

Keyword(s):

Next Generation Sequencing ◽

Best Practices ◽

Best Practice ◽

Variant Calling ◽

Data Evaluation ◽

Phenotypic Trait ◽

Next Generation ◽

Sequencing Data ◽

Sequencing Technologies ◽

Generation Sequencing

Background Next generation sequencing technologies are opening new doors to researchers. One application is the direct discovery of sequence variants that are causative for a phenotypic trait or a disease. The detection of an organisms alterations from a reference genome is know as variant calling, a computational task involving a complex chain of software applications. One key player in the field is the Genome Analysis Toolkit (GATK). The GATK Best Practices are commonly referred recipe for variant calling on human sequencing data. Still the fact the Best Practices are highly specialized on human sequencing data and are permanently evolving is often ignored. Reproducibility is thereby aggravated, leading to continuous reinvention of pretended GATK Best Practice workflows. Results Here we present an automatized variant calling workflow, for the detection of SNPs and indels, that is broadly applicable for model as well as non-model diploid organisms. It is derived from the GATK Best Practice workflow for "Germline short variant discovery", without being focused on human sequencing data. The workflow has been highly optimized to achieve parallelized data evaluation and also maximize performance of individual applications to shorten overall analysis time. Optimized Java garbage collection and heap size settings for the GATK applications SortSam, MarkDuplicates, HaplotypeCaller and GatherVcfs were determined by thorough benchmarking. In doing so, runtimes of an example data evaluation could be reduced from 67 h to less than 35 h. Conclusions The demand for standardized variant calling workflows is proportionally growing with the dropping costs of next generation sequencing methods. Our workflow perfectly fits into this niche, offering automatization, reproducibility and documentation of the variant calling process. Moreover resource usage is lowered to a minimum. Thereby variant calling projects should become more standardized, reducing the barrier further for smaller institutions or groups.

Download Full-text

Fast DNA Sequence Alignment Algorithm Based on Quality Score Using Improved Dynamic Programming and Fuzzy Gap Cost Control

Current Bioinformatics ◽

10.2174/1574893609666140523000227 ◽

2014 ◽

Vol 9 (5) ◽

pp. 540-547

Author(s):

Kwang Kim ◽

Hyun Park ◽

Doo Song

Keyword(s):

Dynamic Programming ◽

Dna Sequence ◽

Sequence Alignment ◽

Cost Control ◽

Quality Score ◽

Alignment Algorithm ◽

Sequence Alignment Algorithm ◽

Dna Sequence Alignment ◽

Improved Dynamic Programming

Download Full-text

Are best practices really best practice? [safety critical industry]

6th IET International Conference on System Safety 2011 ◽

10.1049/cp.2011.0271 ◽

2011 ◽

Author(s):

X. Quayzin

Keyword(s):

Best Practices ◽

Best Practice ◽

Safety Critical

Download Full-text

Fluctuating Asymmetry and Developmental Instability, a Guide to Best Practice

Symmetry ◽

10.3390/sym13010009 ◽

2020 ◽

Vol 13 (1) ◽

pp. 9

Author(s):

John H. Graham

Keyword(s):

Measurement Error ◽

Best Practices ◽

Environmental Stress ◽

Fluctuating Asymmetry ◽

Best Practice ◽

Developmental Instability ◽

The Past ◽

Size Scaling ◽

Research Designs

Best practices in studies of developmental instability, as measured by fluctuating asymmetry, have developed over the past 60 years. Unfortunately, they are haphazardly applied in many of the papers submitted for review. Most often, research designs suffer from lack of randomization, inadequate replication, poor attention to size scaling, lack of attention to measurement error, and unrecognized mixtures of additive and multiplicative errors. Here, I summarize a set of best practices, especially in studies that examine the effects of environmental stress on fluctuating asymmetry.

Download Full-text

Abstract 218: Expanding a Learning Collaborative Model in Chicago to Improve Door to Needle for Stroke Thrombolysis: Raising the Bar

Circulation Cardiovascular Quality and Outcomes ◽

10.1161/circoutcomes.10.suppl_3.218 ◽

2017 ◽

Vol 10 (suppl_3) ◽

Author(s):

Shyam Prabhakaran ◽

Renee M Sednew ◽

Kathleen O’Neill

Keyword(s):

Best Practices ◽

Performance Improvement ◽

Best Practice ◽

Learning Collaborative ◽

Collaborative Model ◽

Stroke Registry ◽

Chicago Area ◽

Regional Collaboration ◽

Content Expert ◽

And Performance

Background: There remains significant opportunities to reduce door-to-needle (DTN) times for stroke despite regional and national efforts. In Chicago, Quality Enhancement for the Speedy Thrombolysis for Stroke (QUESTS) was a one year learning collaborative (LC) which aimed to reduce DTN times at 15 Chicago Primary Stroke Centers. Identification of barriers and sharing of best practices resulted in achieving DTN < 60 minutes within the first quarter of the 2013 initiative and has sustained progress to date. Aligned with Target: Stroke goals, QUESTS 2.0, funded for the 2016 calendar year, invited 9 additional metropolitan Chicago area hospitals to collaborate and further reduce DTN times to a goal < 45 minutes in 50% of eligible patients. Methods: All 24 hospitals participate in the Get With The Guidelines (GWTG) Stroke registry and benchmark group to track DTN performance improvement in 2016. Hospitals implement American Heart Association’s Target Stroke program and share best practices uniquely implemented at sites to reduce DTN times. The LC included a quality and performance improvement leader, a stroke content expert, site visits and quarterly meetings and learning sessions, and reporting of experiences and data. Results: In 2015, the year prior to QUESTS 2.0, the proportion of patients treated with tPA within 45 minutes of hospital arrival increased from 21.6% in Q1 to 31.4% in Q2. During the 2016 funded year, this proportion changed from 31.6% in Q1 to 48.3% in Q2. Conclusions: Using a learning collaborative model to implement strategies to reduce DTN times among 24 Chicago area hospitals continues to impact times. Regional collaboration, data sharing, and best practice sharing should be a model for rapid and sustainable system-wide quality improvement.

Download Full-text