Multithreaded variant calling in elPrep 5

AbstractWe present elPrep 5, which updates the elPrep framework for processing sequencing alignment/map files with variant calling. elPrep 5 can now execute the full pipeline described by the GATK Best Practices for variant calling, which consists of PCR and optical duplicate marking, sorting by coordinate order, base quality score recalibration, and variant calling using the haplotype caller algorithm. elPrep 5 produces identical BAM and VCF output as GATK4 while significantly reducing the runtime by parallelizing and merging the execution of the pipeline steps. Our benchmarks show that elPrep 5 speeds up the runtime of the variant calling pipeline by a factor 8-16x on both whole-exome and whole-genome data while using the same hardware resources as GATK 4. This makes elPrep 5 a suitable drop-in replacement for GATK 4 when faster execution times are needed.

Download Full-text

Multithreaded variant calling in elPrep 5

PLoS ONE ◽

10.1371/journal.pone.0244471 ◽

2021 ◽

Vol 16 (2) ◽

pp. e0244471

Author(s):

Charlotte Herzeel ◽

Pascal Costanza ◽

Dries Decap ◽

Jan Fostier ◽

Roel Wuyts ◽

...

Keyword(s):

Best Practices ◽

Variant Calling ◽

Quality Score ◽

Whole Genome ◽

Genome Data ◽

Whole Exome ◽

Base Quality Score ◽

Execution Times

We present elPrep 5, which updates the elPrep framework for processing sequencing alignment/map files with variant calling. elPrep 5 can now execute the full pipeline described by the GATK Best Practices for variant calling, which consists of PCR and optical duplicate marking, sorting by coordinate order, base quality score recalibration, and variant calling using the haplotype caller algorithm. elPrep 5 produces identical BAM and VCF output as GATK4 while significantly reducing the runtime by parallelizing and merging the execution of the pipeline steps. Our benchmarks show that elPrep 5 speeds up the runtime of the variant calling pipeline by a factor 8-16x on both whole-exome and whole-genome data while using the same hardware resources as GATK4. This makes elPrep 5 a suitable drop-in replacement for GATK4 when faster execution times are needed.

Download Full-text

elPrep: A multithreaded framework for sequence analysis

10.1101/492249 ◽

2018 ◽

Author(s):

Charlotte Herzeel ◽

Pascal Costanza ◽

Dries Decap ◽

Jan Fostier ◽

Wilfried Verachtert

Keyword(s):

Sequence Analysis ◽

Best Practices ◽

Programming Language ◽

Sequence Alignment ◽

Resource Use ◽

Best Practice ◽

Variant Calling ◽

Quality Score ◽

Parallel Execution ◽

Base Quality Score

We present elPrep 4, a reimplementation from scratch of the elPrep framework for processing sequence alignment map files in the Go programming language. elPrep 4 includes multiple new features allowing us to process all of the preparation steps defined by the GATK Best Practice pipelines for variant calling. This includes new and improved functionality for sorting, (optical) duplicate marking, base quality score recalibration, BED and VCF parsing, and various filtering options. The implementations of these options in elPrep 4 faithfully reproduce the outcomes of their counterparts in GATK 4, SAMtools, and Picard, even though the underlying algorithms are redesigned to take advantage of elPrep's parallel execution framework to vastly improve the runtime and resource use compared to these tools. Our benchmarks show that elPrep executes the preparation steps of the GATK Best Practices up to 13x faster on WES data, and up to 7.4x faster for WGS data compared to running the same pipeline with GATK 4, while utilizing fewer compute resources.

Download Full-text

Estimating sequencing error rates using families

BioData Mining ◽

10.1186/s13040-021-00259-6 ◽

2021 ◽

Vol 14 (1) ◽

Author(s):

Kelley Paskov ◽

Jae-Yoon Jung ◽

Brianna Chrisman ◽

Nate T. Stockham ◽

Peter Washington ◽

...

Keyword(s):

Whole Genome Sequencing ◽

Exome Sequencing ◽

Genome Sequencing ◽

Variant Calling ◽

Error Rates ◽

Sequencing Error ◽

Whole Genome ◽

Sequencing Data ◽

Sequencing Platform ◽

Whole Exome

Abstract Background As next-generation sequencing technologies make their way into the clinic, knowledge of their error rates is essential if they are to be used to guide patient care. However, sequencing platforms and variant-calling pipelines are continuously evolving, making it difficult to accurately quantify error rates for the particular combination of assay and software parameters used on each sample. Family data provide a unique opportunity for estimating sequencing error rates since it allows us to observe a fraction of sequencing errors as Mendelian errors in the family, which we can then use to produce genome-wide error estimates for each sample. Results We introduce a method that uses Mendelian errors in sequencing data to make highly granular per-sample estimates of precision and recall for any set of variant calls, regardless of sequencing platform or calling methodology. We validate the accuracy of our estimates using monozygotic twins, and we use a set of monozygotic quadruplets to show that our predictions closely match the consensus method. We demonstrate our method’s versatility by estimating sequencing error rates for whole genome sequencing, whole exome sequencing, and microarray datasets, and we highlight its sensitivity by quantifying performance increases between different versions of the GATK variant-calling pipeline. We then use our method to demonstrate that: 1) Sequencing error rates between samples in the same dataset can vary by over an order of magnitude. 2) Variant calling performance decreases substantially in low-complexity regions of the genome. 3) Variant calling performance in whole exome sequencing data decreases with distance from the nearest target region. 4) Variant calls from lymphoblastoid cell lines can be as accurate as those from whole blood. 5) Whole-genome sequencing can attain microarray-level precision and recall at disease-associated SNV sites. Conclusion Genotype datasets from families are powerful resources that can be used to make fine-grained estimates of sequencing error for any sequencing platform and variant-calling methodology.

Download Full-text

Best practices for the analytical validation of clinical whole-genome sequencing intended for the diagnosis of germline disease

npj Genomic Medicine ◽

10.1038/s41525-020-00154-9 ◽

2020 ◽

Vol 5 (1) ◽

Author(s):

Christian R. Marshall ◽

◽

Shimul Chowdhury ◽

Ryan J. Taft ◽

Mathew S. Lebo ◽

...

Keyword(s):

Best Practices ◽

Whole Genome Sequencing ◽

Genome Sequencing ◽

Test Performance ◽

Genetic Disorders ◽

Chromosomal Microarray ◽

Chromosomal Microarray Analysis ◽

Whole Genome ◽

Analytical Validation ◽

Whole Exome

Abstract Whole-genome sequencing (WGS) has shown promise in becoming a first-tier diagnostic test for patients with rare genetic disorders; however, standards addressing the definition and deployment practice of a best-in-class test are lacking. To address these gaps, the Medical Genome Initiative, a consortium of leading healthcare and research organizations in the US and Canada, was formed to expand access to high-quality clinical WGS by publishing best practices. Here, we present consensus recommendations on clinical WGS analytical validation for the diagnosis of individuals with suspected germline disease with a focus on test development, upfront considerations for test design, test validation practices, and metrics to monitor test performance. This work also provides insight into the current state of WGS testing at each member institution, including the utilization of reference and other standards across sites. Importantly, members of this initiative strongly believe that clinical WGS is an appropriate first-tier test for patients with rare genetic disorders, and at minimum is ready to replace chromosomal microarray analysis and whole-exome sequencing. The recommendations presented here should reduce the burden on laboratories introducing WGS into clinical practice, and support safe and effective WGS testing for diagnosis of germline disease.

Download Full-text

Lacer: accurate base quality score recalibration for improving variant calling from next-generation sequencing data in any organism

10.1101/130732 ◽

2017 ◽

Author(s):

Jade C.S. Chung ◽

Swaine L. Chen

Keyword(s):

Next Generation Sequencing ◽

Variant Calling ◽

Quality Score ◽

Identification Accuracy ◽

Next Generation Sequencing Data ◽

Sequencing Error ◽

Next Generation ◽

Sequencing Data ◽

Base Quality Score ◽

Generation Sequencing

AbstractNext-generation sequencing data is accompanied by quality scores that quantify sequencing error. Inaccuracies in these quality scores propagate through all subsequent analyses; thus base quality score recalibration is a standard step in many next-generation sequencing workflows, resulting in improved variant calls. Current base quality score recalibration algorithms rely on the assumption that sequencing errors are already known; for human resequencing data, relatively complete variant databases facilitate this. However, because existing databases are still incomplete, recalibration is still inaccurate; and most organisms do not have variant databases, exacerbating inaccuracy for non-human data. To overcome these logical and practical problems, we introduce Lacer, which recalibrates base quality scores without assuming knowledge of correct and incorrect bases and without requiring knowledge of common variants. Lacer is the first logically sound, fully general, and truly accurate base recalibrator. Lacer enhances variant identification accuracy for resequencing data of human as well as other organisms (which are not accessible to current recalibrators), simultaneously improving and extending the benefits of base quality score recalibration to nearly all ongoing sequencing projects. Lacer is available at: https://github.com/swainechen/lacer.

Download Full-text

Seave: a comprehensive web platform for storing and interrogating human genomic variation

10.1101/258061 ◽

2018 ◽

Cited By ~ 3

Author(s):

Velimir Gayevskiy ◽

Tony Roscioli ◽

Marcel E Dinger ◽

Mark J Cowley

Keyword(s):

Cloud Computing ◽

Large Scale ◽

Variant Calling ◽

Genomic Variation ◽

Whole Genome ◽

Genome Data ◽

Pathogenicity Prediction ◽

Data Scaling ◽

Human Genomic ◽

Web Platform

AbstractCapability for genome sequencing and variant calling has increased dramatically, enabling large scale genomic interrogation of human disease. However, discovery is hindered by the current limitations in genomic interpretation, which remains a complicated and disjointed process. We introduce Seave, a web platform that enables variants to be easily filtered and annotated with in silico pathogenicity prediction scores and annotations from popular disease databases. Seave stores genomic variation of all types and sizes, and allows filtering for specific inheritance patterns, quality values, allele frequencies and gene lists. Seave is open source and deployable locally, or on a cloud computing provider, and works readily with gene panel, exome and whole genome data, scaling from single labs to multi-institution scale.

Download Full-text

P4-044: THE GCAD CLOUD-BASED WORKFLOW FOR PROCESSING WHOLE EXOME AND WHOLE GENOME DATA FROM THE ALZHEIMER'S DISEASE SEQUENCING PROJECT

Alzheimer s & Dementia ◽

10.1016/j.jalz.2018.06.2446 ◽

2006 ◽

Vol 14 (7S_Part_27) ◽

pp. P1450-P1450

Author(s):

Prabhakaran Gangadharan ◽

Yuk Yee Leung ◽

Otto Valladares ◽

Yi-Fan Chou ◽

Amanda B. Kuzma ◽

...

Keyword(s):

Alzheimer’S Disease ◽

Alzheimer's Disease ◽

Whole Genome ◽

Sequencing Project ◽

Genome Data ◽

Whole Exome

Download Full-text

BALSA: Integrated secondary analysis for whole-genome and whole-exome sequencing, accelerated by GPU

10.7287/peerj.preprints.373v1 ◽

2014 ◽

Author(s):

Ruibang Luo ◽

Yiu-Lun Wong ◽

Wai-Chun Law ◽

Lap-Kei Lee ◽

Chi-Man Liu ◽

...

Keyword(s):

Exome Sequencing ◽

Whole Exome Sequencing ◽

Secondary Analysis ◽

Variant Calling ◽

Statistical Testing ◽

Next Generation Sequencing Data ◽

Whole Genome ◽

Sequencing Data ◽

Accurate Analysis ◽

Whole Exome

This paper reports an integrated solution, called BALSA, for the secondary analysis of next generation sequencing data; it exploits the computational power of GPU and an intricate memory management to give a fast and accurate analysis. From raw reads to variants (including SNPs and Indels), BALSA, using just a single computing node with a commodity GPU board, takes 5.5 hours to process 50-fold whole genome sequencing (~750 million 100bp paired-end reads), or just 25 minutes for 210-fold whole exome sequencing. BALSA’s speed is rooted at its parallel algorithms to effectively exploit a GPU to speed up processes like alignment, realignment and statistical testing. BALSA incorporates a 16-genotype model to support the calling of SNPs and Indels and achieves competitive variant calling accuracy and sensitivity when compared to the ensemble of six popular variant callers. BALSA also supports efficient identification of somatic SNVs and CNVs; experiments showed that BALSA recovers all the previously validated somatic SNVs and CNVs, and it is more sensitive for somatic Indel detection. BALSA outputs variants in VCF format. A pileup-like SNAPSHOT format, while maintaining the same fidelity as BAM in variant calling, enables efficient storage and indexing, and facilitates the App development of downstream analyses. BALSA is available at: http://sourceforge.net/p/balsa

Download Full-text