Comparison of three variant callers for human whole genome sequencing

ABSTRACTTesting of patients with genetics-related disorders is in progress of shifting from single gene assays to gene panel sequencing, whole-exome sequencing (WES) and whole-genome sequencing (WGS). Since WGS is unquestionably becoming a new foundation for molecular analyses, we decided to compare three currently used tools for variant calling of human whole genome sequencing data. We tested DeepVariant, a new TensorFlow machine learning-based variant caller, and compared this tool to GATK 4.0 and SpeedSeq, using 30×, 15× and 10× WGS data of the well-known NA12878 DNA reference sample.According to our comparison, the performance on SNV calling was almost similar in 30× data, with all three variant callers reaching F-Scores (i.e. harmonic mean of recall and precision) equal to 0.98. In contrast, DeepVariant was more precise in indel calling than GATK and SpeedSeq, as demonstrated by F-Scores of 0.94, 0.90 and 0.84, respectively.We conclude that the DeepVariant tool has great potential and usefulness for analysis of WGS data in medical genetics.

Download Full-text

Estimating sequencing error rates using families

BioData Mining ◽

10.1186/s13040-021-00259-6 ◽

2021 ◽

Vol 14 (1) ◽

Author(s):

Kelley Paskov ◽

Jae-Yoon Jung ◽

Brianna Chrisman ◽

Nate T. Stockham ◽

Peter Washington ◽

...

Keyword(s):

Whole Genome Sequencing ◽

Exome Sequencing ◽

Genome Sequencing ◽

Variant Calling ◽

Error Rates ◽

Sequencing Error ◽

Whole Genome ◽

Sequencing Data ◽

Sequencing Platform ◽

Whole Exome

Abstract Background As next-generation sequencing technologies make their way into the clinic, knowledge of their error rates is essential if they are to be used to guide patient care. However, sequencing platforms and variant-calling pipelines are continuously evolving, making it difficult to accurately quantify error rates for the particular combination of assay and software parameters used on each sample. Family data provide a unique opportunity for estimating sequencing error rates since it allows us to observe a fraction of sequencing errors as Mendelian errors in the family, which we can then use to produce genome-wide error estimates for each sample. Results We introduce a method that uses Mendelian errors in sequencing data to make highly granular per-sample estimates of precision and recall for any set of variant calls, regardless of sequencing platform or calling methodology. We validate the accuracy of our estimates using monozygotic twins, and we use a set of monozygotic quadruplets to show that our predictions closely match the consensus method. We demonstrate our method’s versatility by estimating sequencing error rates for whole genome sequencing, whole exome sequencing, and microarray datasets, and we highlight its sensitivity by quantifying performance increases between different versions of the GATK variant-calling pipeline. We then use our method to demonstrate that: 1) Sequencing error rates between samples in the same dataset can vary by over an order of magnitude. 2) Variant calling performance decreases substantially in low-complexity regions of the genome. 3) Variant calling performance in whole exome sequencing data decreases with distance from the nearest target region. 4) Variant calls from lymphoblastoid cell lines can be as accurate as those from whole blood. 5) Whole-genome sequencing can attain microarray-level precision and recall at disease-associated SNV sites. Conclusion Genotype datasets from families are powerful resources that can be used to make fine-grained estimates of sequencing error for any sequencing platform and variant-calling methodology.

Download Full-text

VPMBench: a test bench for variant prioritization methods

BMC Bioinformatics ◽

10.1186/s12859-021-04458-0 ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Andreas Ruscheinski ◽

Anna Lena Reimler ◽

Roland Ewald ◽

Adelinde M. Uhrmacher

Keyword(s):

Whole Genome Sequencing ◽

Genome Sequencing ◽

Test Bench ◽

Clinical Diagnostics ◽

Tool Support ◽

Whole Genome Sequencing Data ◽

Whole Genome ◽

Sequencing Data ◽

Variant Prioritization ◽

Whole Exome

Abstract Background Clinical diagnostics of whole-exome and whole-genome sequencing data requires geneticists to consider thousands of genetic variants for each patient. Various variant prioritization methods have been developed over the last years to aid clinicians in identifying variants that are likely disease-causing. Each time a new method is developed, its effectiveness must be evaluated and compared to other approaches based on the most recently available evaluation data. Doing so in an unbiased, systematic, and replicable manner requires significant effort. Results The open-source test bench “VPMBench” automates the evaluation of variant prioritization methods. VPMBench introduces a standardized interface for prioritization methods and provides a plugin system that makes it easy to evaluate new methods. It supports different input data formats and custom output data preparation. VPMBench exploits declaratively specified information about the methods, e.g., the variants supported by the methods. Plugins may also be provided in a technology-agnostic manner via containerization. Conclusions VPMBench significantly simplifies the evaluation of both custom and published variant prioritization methods. As we expect variant prioritization methods to become ever more critical with the advent of whole-genome sequencing in clinical diagnostics, such tool support is crucial to facilitate methodological research.

Download Full-text

HLA‐poll: An ensemble suite of human leukocyte antigen‐prediction tools for whole‐exome and whole‐genome sequencing data

International Journal of Laboratory Hematology ◽

10.1111/ijlh.13280 ◽

2020 ◽

Vol 42 (5) ◽

Author(s):

Xinming Zhuo ◽

Kevin Quann ◽

Kayla Parr ◽

Rachel Stewart ◽

Warren Shlomchik ◽

...

Keyword(s):

Human Leukocyte Antigen ◽

Whole Genome Sequencing ◽

Genome Sequencing ◽

Human Leukocyte ◽

Whole Genome Sequencing Data ◽

Whole Genome ◽

Sequencing Data ◽

Leukocyte Antigen ◽

Prediction Tools ◽

Whole Exome

Download Full-text

A Comparison of Tools for Copy-Number Variation Detection in Germline Whole Exome and Whole Genome Sequencing Data

Cancers ◽

10.3390/cancers13246283 ◽

2021 ◽

Vol 13 (24) ◽

pp. 6283

Author(s):

Migle Gabrielaite ◽

Mathias Husted Torp ◽

Malthe Sebro Rasmussen ◽

Sergio Andreu-Sánchez ◽

Filipe Garrett Vieira ◽

...

Keyword(s):

Whole Genome Sequencing ◽

Genome Sequencing ◽

Copy Number ◽

Reference Sample ◽

Snp Array ◽

Copy Number Variations ◽

Whole Genome Sequencing Data ◽

Whole Genome ◽

Standard Reference Sample ◽

Whole Exome

Copy-number variations (CNVs) have important clinical implications for several diseases and cancers. Relevant CNVs are hard to detect because common structural variations define large parts of the human genome. CNV calling from short-read sequencing would allow single protocol full genomic profiling. We reviewed 50 popular CNV calling tools and included 11 tools for benchmarking in a reference cohort encompassing 39 whole genome sequencing (WGS) samples paired current clinical standard—SNP-array based CNV calling. Additionally, for nine samples we also performed whole exome sequencing (WES), to address the effect of sequencing protocol on CNV calling. Furthermore, we included Gold Standard reference sample NA12878, and tested 12 samples with CNVs confirmed by multiplex ligation-dependent probe amplification (MLPA). Tool performance varied greatly in the number of called CNVs and bias for CNV lengths. Some tools had near-perfect recall of CNVs from arrays for some samples, but poor precision. Several tools had better performance for NA12878, which could be a result of overfitting. We suggest combining the best tools also based on different methodologies: GATK gCNV, Lumpy, DELLY, and cn.MOPS. Reducing the total number of called variants could potentially be assisted by the use of background panels for filtering of frequently called variants.

Download Full-text

A comparison of tools for copy-number variation detection in germline whole exome and whole genome sequencing data

10.1101/2021.04.30.442110 ◽

2021 ◽

Author(s):

Migle Gabrielaite ◽

Mathias Husted Torp ◽

Sergio Andreu-Sánchez ◽

Filipe Garrett Vieira ◽

Christina Bligaard Pedersen ◽

...

Keyword(s):

Whole Genome Sequencing ◽

Genome Sequencing ◽

Copy Number ◽

Reference Data ◽

Reference Sample ◽

Data Sets ◽

Whole Genome ◽

Sequencing Data ◽

Standard Reference Sample ◽

Whole Exome

Background: Copy-number variations (CNVs) have important clinical implications for several diseases and cancers. The clinically relevant CNVs are hard to detect because CNVs are common structural variations that define large parts of the normal human genome. CNV calling from short-read sequencing data has the potential to leverage available cohort studies and allow full genomic profiling in the clinic without the need for additional data modalities. Questions regarding performance of CNV calling tools for clinical use and suitable sequencing protocols remain poorly addressed, mainly because of the lack of good reference data sets. Methods: We reviewed 50 popular CNV calling tools and included 11 tools for benchmarking in a unique reference cohort encompassing 39 whole genome sequencing (WGS) samples paired with analysis by the current clinical standard—SNP-array based CNV calling. Additionally, for nine of these samples we performed whole exome sequencing (WES) performed, in order to address the effect of sequencing protocol on CNV calling. Furthermore, we included Gold Standard reference sample NA12878, and tested 12 samples with CNVs confirmed by multiplex ligation-dependent probe amplification (MLPA). Results: Tool performance varied greatly in the number of called CNVs and bias for CNV lengths. Some tools had near-perfect recall of CNVs from arrays for some samples, but poor precision. Filtering output by CNV ranks from tools did not salvage precision. Several tools had better performance patterns for NA12878, and we hypothesize that this is the result of overfitting during the tool development. Conclusions: We suggest combining tools with the best recall: GATK gCNV, Lumpy, DELLY, and cn.MOPS. These tools also capture different CNVs. Further improvements in precision requires additional development of tools, reference data sets, and annotation of CNVs, potentially assisted by the use of background panels for filtering of frequently called variants.

Download Full-text

Empirical evaluation of variant calling accuracy using ultra-deep whole-genome sequencing data

Scientific Reports ◽

10.1038/s41598-018-38346-0 ◽

2019 ◽

Vol 9 (1) ◽

Cited By ~ 12

Author(s):

Toshihiro Kishikawa ◽

Yukihide Momozawa ◽

Takeshi Ozeki ◽

Taisei Mushiroda ◽

Hidenori Inohara ◽

...

Keyword(s):

Whole Genome Sequencing ◽

Genome Sequencing ◽

Empirical Evaluation ◽

Variant Calling ◽

Whole Genome Sequencing Data ◽

Whole Genome ◽

Sequencing Data

Download Full-text

Retrospective evaluation of whole exome and genome mutation calls in 746 cancer samples

Nature Communications ◽

10.1038/s41467-020-18151-y ◽

2020 ◽

Vol 11 (1) ◽

Author(s):

Matthew H. Bailey ◽

◽

William U. Meyerson ◽

Lewis Jonathan Dursi ◽

Liang-Bo Wang ◽

...

Keyword(s):

Whole Genome Sequencing ◽

Genome Sequencing ◽

Gc Content ◽

Cancer Genome ◽

The Cancer Genome Atlas ◽

Whole Genome Sequencing Data ◽

Whole Genome ◽

Sequencing Data ◽

Whole Exome ◽

Cancer Genome Atlas

AbstractThe Cancer Genome Atlas (TCGA) and International Cancer Genome Consortium (ICGC) curated consensus somatic mutation calls using whole exome sequencing (WES) and whole genome sequencing (WGS), respectively. Here, as part of the ICGC/TCGA Pan-Cancer Analysis of Whole Genomes (PCAWG) Consortium, which aggregated whole genome sequencing data from 2,658 cancers across 38 tumour types, we compare WES and WGS side-by-side from 746 TCGA samples, finding that ~80% of mutations overlap in covered exonic regions. We estimate that low variant allele fraction (VAF < 15%) and clonal heterogeneity contribute up to 68% of private WGS mutations and 71% of private WES mutations. We observe that ~30% of private WGS mutations trace to mutations identified by a single variant caller in WES consensus efforts. WGS captures both ~50% more variation in exonic regions and un-observed mutations in loci with variable GC-content. Together, our analysis highlights technological divergences between two reproducible somatic variant detection efforts.

Download Full-text

Empirical design of a variant quality control pipeline for whole genome sequencing data using replicate discordance

Scientific Reports ◽

10.1038/s41598-019-52614-7 ◽

2019 ◽

Vol 9 (1) ◽

Cited By ~ 3

Author(s):

Robert P. Adelson ◽

Alan E. Renton ◽

Wentian Li ◽

Nir Barzilai ◽

Gil Atzmon ◽

...

Keyword(s):

Whole Genome Sequencing ◽

Genome Sequencing ◽

Variant Calling ◽

Read Depth ◽

Concordance Rate ◽

Whole Genome Sequencing Data ◽

Analysis Tool ◽

Whole Genome ◽

Sequencing Data ◽

Genome Wide

Abstract The success of next-generation sequencing depends on the accuracy of variant calls. Few objective protocols exist for QC following variant calling from whole genome sequencing (WGS) data. After applying QC filtering based on Genome Analysis Tool Kit (GATK) best practices, we used genotype discordance of eight samples that were sequenced twice each to evaluate the proportion of potentially inaccurate variant calls. We designed a QC pipeline involving hard filters to improve replicate genotype concordance, which indicates improved accuracy of genotype calls. Our pipeline analyzes the efficacy of each filtering step. We initially applied this strategy to well-characterized variants from the ClinVar database, and subsequently to the full WGS dataset. The genome-wide biallelic pipeline removed 82.11% of discordant and 14.89% of concordant genotypes, and improved the concordance rate from 98.53% to 99.69%. The variant-level read depth filter most improved the genome-wide biallelic concordance rate. We also adapted this pipeline for triallelic sites, given the increasing proportion of multiallelic sites as sample sizes increase. For triallelic sites containing only SNVs, the concordance rate improved from 97.68% to 99.80%. Our QC pipeline removes many potentially false positive calls that pass in GATK, and may inform future WGS studies prior to variant effect analysis.

Download Full-text