Open-source mapping and variant calling for large-scale NGS data from original base-quality scores

AbstractStandardized genome informatics protocols minimize reprocessing costs and facilitate harmonization across studies if implemented in a transparent, accessible and reproducible manner. Here we define the OQFE protocol, a lossless read-mapping protocol that retains key features of existing NGS standard methods. We demonstrate that variants can be called directly from NovaSeq OQFE data without the need for base quality score recalibration and describe a large-scale variant calling protocol for OQFE data. The OQFE protocol is open-source and a containerized implementation is provided.

Download Full-text

Lacer: accurate base quality score recalibration for improving variant calling from next-generation sequencing data in any organism

10.1101/130732 ◽

2017 ◽

Author(s):

Jade C.S. Chung ◽

Swaine L. Chen

Keyword(s):

Next Generation Sequencing ◽

Variant Calling ◽

Quality Score ◽

Identification Accuracy ◽

Next Generation Sequencing Data ◽

Sequencing Error ◽

Next Generation ◽

Sequencing Data ◽

Base Quality Score ◽

Generation Sequencing

AbstractNext-generation sequencing data is accompanied by quality scores that quantify sequencing error. Inaccuracies in these quality scores propagate through all subsequent analyses; thus base quality score recalibration is a standard step in many next-generation sequencing workflows, resulting in improved variant calls. Current base quality score recalibration algorithms rely on the assumption that sequencing errors are already known; for human resequencing data, relatively complete variant databases facilitate this. However, because existing databases are still incomplete, recalibration is still inaccurate; and most organisms do not have variant databases, exacerbating inaccuracy for non-human data. To overcome these logical and practical problems, we introduce Lacer, which recalibrates base quality scores without assuming knowledge of correct and incorrect bases and without requiring knowledge of common variants. Lacer is the first logically sound, fully general, and truly accurate base recalibrator. Lacer enhances variant identification accuracy for resequencing data of human as well as other organisms (which are not accessible to current recalibrators), simultaneously improving and extending the benefits of base quality score recalibration to nearly all ongoing sequencing projects. Lacer is available at: https://github.com/swainechen/lacer.

Download Full-text

Target variant detection in leukemia using unaligned RNA-Seq reads

10.1101/295808 ◽

2018 ◽

Cited By ~ 1

Author(s):

Eric Olivier Audemard ◽

Patrick Gendron ◽

Vincent-Philippe Lavallée ◽

Josée Hébert ◽

Guy Sauvageau ◽

...

Keyword(s):

Variant Calling ◽

The Cancer Genome Atlas ◽

Rna Seq ◽

Read Mapping ◽

Targeted Mutation ◽

Cancer Genome Atlas ◽

Computationally Intensive ◽

And Performance ◽

Next Generation Sequencing Ngs ◽

Ngs Data

AbstractMutations identified in each Acute Myeloid Leukemia (AML) patients are useful for prognosis and to select targeted therapies. Detection of such mutations by the analysis of Next-Generation Sequencing (NGS) data requires a computationally intensive read mapping step and application of several variant calling methods. Targeted mutation identification drastically shifts the usual tradeoff between accuracy and performance by concentrating all computations over a small portion of sequence space. Here, we present km, an efficient approach leveraging k-mer decomposition of reads to identify targeted mutations. Our approach is versatile, as it can detect single-base mutations, several types of insertions and deletions, as well as fusions. We used two independent AML cohorts (The Cancer Genome Atlas and Leucegene), to show that mutation detection bykmis fast, accurate and mainly limited by sequencing depth. Therefore,kmallows to establish fast diagnostics from NGS data, and could be suitable for clinical applications.

Download Full-text

Comparison of Read Mapping and Variant Calling Tools for the Analysis of Plant NGS Data

Plants ◽

10.3390/plants9040439 ◽

2020 ◽

Vol 9 (4) ◽

pp. 439 ◽

Cited By ~ 3

Author(s):

Hanna Marie Schilbert ◽

Andreas Rempel ◽

Boas Pucker

Keyword(s):

High Throughput Sequencing ◽

Performance Metrics ◽

Model Organism ◽

Variant Calling ◽

Reference Sequence ◽

Read Mapping ◽

The Past ◽

Sequencing Technologies ◽

Plant Sciences ◽

Ngs Data

High-throughput sequencing technologies have rapidly developed during the past years and have become an essential tool in plant sciences. However, the analysis of genomic data remains challenging and relies mostly on the performance of automatic pipelines. Frequently applied pipelines involve the alignment of sequence reads against a reference sequence and the identification of sequence variants. Since most benchmarking studies of bioinformatics tools for this purpose have been conducted on human datasets, there is a lack of benchmarking studies in plant sciences. In this study, we evaluated the performance of 50 different variant calling pipelines, including five read mappers and ten variant callers, on six real plant datasets of the model organism Arabidopsis thaliana. Sets of variants were evaluated based on various parameters including sensitivity and specificity. We found that all investigated tools are suitable for analysis of NGS data in plant research. When looking at different performance metrics, BWA-MEM and Novoalign were the best mappers and GATK returned the best results in the variant calling step.

Download Full-text

Sequence variation aware genome references and read mapping with the variation graph toolkit

10.1101/234856 ◽

2017 ◽

Cited By ~ 12

Author(s):

Erik Garrison ◽

Jouni Sirén ◽

Adam M. Novak ◽

Glenn Hickey ◽

Jordan M. Eizenga ◽

...

Keyword(s):

Dna Sequence ◽

Large Scale ◽

De Novo ◽

Sequence Data ◽

Variant Calling ◽

Read Mapping ◽

Dna Sequence Data ◽

Suffix Arrays ◽

Improved Accuracy ◽

Reference Genomes

AbstractReference genomes guide our interpretation of DNA sequence data. However, conventional linear references are fundamentally limited in that they represent only one version of each locus, whereas the population may contain multiple variants. When the reference represents an individual’s genome poorly, it can impact read mapping and introduce bias. Variation graphs are bidirected DNA sequence graphs that compactly represent genetic variation, including large scale structural variation such as inversions and duplications.1 Equivalent structures are produced by de novo genome assemblers.2,3 Here we present vg, a toolkit of computational methods for creating, manipulating, and utilizing these structures as references at the scale of the human genome. vg provides an efficient approach to mapping reads onto arbitrary variation graphs using generalized compressed suffix arrays,4 with improved accuracy over alignment to a linear reference, creating data structures to support downstream variant calling and genotyping. These capabilities make using variation graphs as reference structures for DNA sequencing practical at the scale of vertebrate genomes, or at the topological complexity of new species assemblies.

Download Full-text

Comparison of read mapping and variant calling tools for the analysis of plant NGS data

10.1101/2020.03.10.986059 ◽

2020 ◽

Author(s):

Hanna Marie Schilbert ◽

Andreas Rempel ◽

Boas Pucker

Keyword(s):

High Throughput Sequencing ◽

Model Organism ◽

Variant Calling ◽

Reference Sequence ◽

Read Mapping ◽

The Past ◽

Sequencing Technologies ◽

Plant Sciences ◽

Ngs Data ◽

Real Plant

AbstractHigh-throughput sequencing technologies have rapidly developed during the past years and became an essential tool in plant sciences. However, the analysis of genomic data remains challenging and relies mostly on the performance of automatic pipelines. Frequently applied pipelines involve the alignment of sequence reads against a reference sequence and the identification of sequence variants. Since most benchmarking studies of bioinformatics tools for this purpose have been conducted on human datasets, there is a lack of benchmarking studies in plant sciences. In this study, we evaluated the performance of 50 different variant calling pipelines, including five read mappers and ten variant callers, on six real plant datasets of the model organism Arabidopsis thaliana. Sets of variants were evaluated based on various parameters including sensitivity and specificity. We found that all investigated tools are suitable for analysis of NGS data in plant research. When looking at different performance metrices, BWA-MEM and Novoalign were the best mappers and GATK returned the best results in the variant calling step.

Download Full-text

Multithreaded variant calling in elPrep 5

10.1101/2020.12.11.421073 ◽

2020 ◽

Author(s):

Charlotte Herzeel ◽

Pascal Costanza ◽

Dries Decap ◽

Jan Fostier ◽

Roel Wuyts ◽

...

Keyword(s):

Best Practices ◽

Variant Calling ◽

Quality Score ◽

Whole Genome ◽

Genome Data ◽

Whole Exome ◽

Base Quality Score ◽

Execution Times

AbstractWe present elPrep 5, which updates the elPrep framework for processing sequencing alignment/map files with variant calling. elPrep 5 can now execute the full pipeline described by the GATK Best Practices for variant calling, which consists of PCR and optical duplicate marking, sorting by coordinate order, base quality score recalibration, and variant calling using the haplotype caller algorithm. elPrep 5 produces identical BAM and VCF output as GATK4 while significantly reducing the runtime by parallelizing and merging the execution of the pipeline steps. Our benchmarks show that elPrep 5 speeds up the runtime of the variant calling pipeline by a factor 8-16x on both whole-exome and whole-genome data while using the same hardware resources as GATK 4. This makes elPrep 5 a suitable drop-in replacement for GATK 4 when faster execution times are needed.

Download Full-text

elPrep: A multithreaded framework for sequence analysis

10.1101/492249 ◽

2018 ◽

Author(s):

Charlotte Herzeel ◽

Pascal Costanza ◽

Dries Decap ◽

Jan Fostier ◽

Wilfried Verachtert

Keyword(s):

Sequence Analysis ◽

Best Practices ◽

Programming Language ◽

Sequence Alignment ◽

Resource Use ◽

Best Practice ◽

Variant Calling ◽

Quality Score ◽

Parallel Execution ◽

Base Quality Score

We present elPrep 4, a reimplementation from scratch of the elPrep framework for processing sequence alignment map files in the Go programming language. elPrep 4 includes multiple new features allowing us to process all of the preparation steps defined by the GATK Best Practice pipelines for variant calling. This includes new and improved functionality for sorting, (optical) duplicate marking, base quality score recalibration, BED and VCF parsing, and various filtering options. The implementations of these options in elPrep 4 faithfully reproduce the outcomes of their counterparts in GATK 4, SAMtools, and Picard, even though the underlying algorithms are redesigned to take advantage of elPrep's parallel execution framework to vastly improve the runtime and resource use compared to these tools. Our benchmarks show that elPrep executes the preparation steps of the GATK Best Practices up to 13x faster on WES data, and up to 7.4x faster for WGS data compared to running the same pipeline with GATK 4, while utilizing fewer compute resources.

Download Full-text

Multithreaded variant calling in elPrep 5

PLoS ONE ◽

10.1371/journal.pone.0244471 ◽

2021 ◽

Vol 16 (2) ◽

pp. e0244471

Author(s):

Charlotte Herzeel ◽

Pascal Costanza ◽

Dries Decap ◽

Jan Fostier ◽

Roel Wuyts ◽

...

Keyword(s):

Best Practices ◽

Variant Calling ◽

Quality Score ◽

Whole Genome ◽

Genome Data ◽

Whole Exome ◽

Base Quality Score ◽

Execution Times

We present elPrep 5, which updates the elPrep framework for processing sequencing alignment/map files with variant calling. elPrep 5 can now execute the full pipeline described by the GATK Best Practices for variant calling, which consists of PCR and optical duplicate marking, sorting by coordinate order, base quality score recalibration, and variant calling using the haplotype caller algorithm. elPrep 5 produces identical BAM and VCF output as GATK4 while significantly reducing the runtime by parallelizing and merging the execution of the pipeline steps. Our benchmarks show that elPrep 5 speeds up the runtime of the variant calling pipeline by a factor 8-16x on both whole-exome and whole-genome data while using the same hardware resources as GATK4. This makes elPrep 5 a suitable drop-in replacement for GATK4 when faster execution times are needed.

Download Full-text

Faculty Opinions recommendation of A Scalable Open-Source Pipeline for Large-Scale Root Phenotyping of Arabidopsis.

Faculty Opinions – Post-Publication Peer Review of the Biomedical Literature ◽

10.3410/f.718447140.793507250 ◽

2015 ◽

Author(s):

José Dinneny

Keyword(s):

Open Source ◽

Large Scale ◽

Root Phenotyping

Download Full-text

A standardized framework for testing the performance of sleep-tracking technology: step-by-step guidelines and open-source code

SLEEP ◽

10.1093/sleep/zsaa170 ◽

2020 ◽

Author(s):

Luca Menghini ◽

Nicola Cellini ◽

Aimee Goldstone ◽

Fiona C Baker ◽

Massimiliano de Zambotti

Keyword(s):

Open Source ◽

Validation Studies ◽

Large Scale ◽

Analytical Framework ◽

Clinical Settings ◽

Large Scale Data ◽

Fast Pace ◽

Epoch Analysis ◽

Tracking Devices

Abstract Sleep-tracking devices, particularly within the consumer sleep technology (CST) space, are increasingly used in both research and clinical settings, providing new opportunities for large-scale data collection in highly ecological conditions. Due to the fast pace of the CST industry combined with the lack of a standardized framework to evaluate the performance of sleep trackers, their accuracy and reliability in measuring sleep remains largely unknown. Here, we provide a step-by-step analytical framework for evaluating the performance of sleep trackers (including standard actigraphy), as compared with gold-standard polysomnography (PSG) or other reference methods. The analytical guidelines are based on recent recommendations for evaluating and using CST from our group and others (de Zambotti and colleagues; Depner and colleagues), and include raw data organization as well as critical analytical procedures, including discrepancy analysis, Bland–Altman plots, and epoch-by-epoch analysis. Analytical steps are accompanied by open-source R functions (depicted at https://sri-human-sleep.github.io/sleep-trackers-performance/AnalyticalPipeline_v1.0.0.html). In addition, an empirical sample dataset is used to describe and discuss the main outcomes of the proposed pipeline. The guidelines and the accompanying functions are aimed at standardizing the testing of CSTs performance, to not only increase the replicability of validation studies, but also to provide ready-to-use tools to researchers and clinicians. All in all, this work can help to increase the efficiency, interpretation, and quality of validation studies, and to improve the informed adoption of CST in research and clinical settings.

Download Full-text