scholarly journals Evaluating the accuracy of variant calling methods using the frequency of parent-offspring genotype mismatch

Author(s):  
Russ Jasper ◽  
Tegan Krista McDonald ◽  
Pooja Singh ◽  
Menhmeng Lu ◽  
Clément Rougeux ◽  
...  

The use of NGS datasets has increased dramatically over the last decade, however, there have been few systematic analyses quantifying the accuracy of the commonly used variant caller programs. Here we used a familial design consisting of diploid tissue from a single Pinus contorta parent and the maternally derived haploid tissue from 106 full-sibling offspring, where mismatches could only arise due to mutation or bioinformatic error. Given the rarity of mutation, we used the rate of mismatches between parent and offspring genotype calls to infer the SNP genotyping error rates of FreeBayes, HaplotypeCaller, SAMtools, UnifiedGenotyper, and VarScan. With baseline filtering HaplotypeCaller and UnifiedGenotyper yielded one to two orders of magnitude larger numbers of SNPs and error rates, whereas FreeBayes, SAMtools and VarScan yielded lower numbers of SNPs and more modest error rates. To facilitate comparison between variant callers we standardized each SNP set to the same number of SNPs using additional filtering, where UnifiedGenotyper consistently produced the smallest proportion of genotype errors, followed by HaplotypeCaller, VarScan, SAMtools, and FreeBayes. Additionally, we found that error rates were minimized for SNPs called by more than one variant caller. Finally, we evaluated the performance of various commonly used filtering metrics on SNP calling. Our analysis provides a quantitative assessment of the accuracy of five widely used variant calling programs and offers valuable insights into both the choice of variant caller program and the choice of filtering metrics, especially for researchers using non-model study systems.

Author(s):  
Russ Jasper ◽  
Tegan Krista McDonald ◽  
Pooja Singh ◽  
Mengmeng Lu ◽  
Clément Rougeux ◽  
...  

The use of NGS datasets has increased dramatically over the last decade, however, there have been few systematic analyses quantifying the accuracy of the commonly used variant caller programs. Here we used a familial design consisting of diploid tissue from a single Pinus contorta parent and the maternally derived haploid tissue from 106 full-sibling offspring, where mismatches could only arise due to mutation or bioinformatic error. Given the rarity of mutation, we used the rate of mismatches between parent and offspring genotype calls to infer the SNP genotyping error rates of FreeBayes, HaplotypeCaller, SAMtools, UnifiedGenotyper, and VarScan. With baseline filtering HaplotypeCaller and UnifiedGenotyper yielded one to two orders of magnitude larger numbers of SNPs and error rates, whereas FreeBayes, SAMtools and VarScan yielded lower numbers of SNPs and more modest error rates. To facilitate comparison between variant callers we standardized each SNP set to the same number of SNPs using additional filtering, where UnifiedGenotyper consistently produced the smallest proportion of genotype errors, followed by HaplotypeCaller, VarScan, SAMtools, and FreeBayes. Additionally, we found that error rates were minimized for SNPs called by more than one variant caller. Finally, we evaluated the performance of various commonly used filtering metrics on SNP calling. Our analysis provides a quantitative assessment of the accuracy of five widely used variant calling programs and offers valuable insights into both the choice of variant caller program and the choice of filtering metrics, especially for researchers using non-model study systems.


2021 ◽  
Vol 8 (Supplement_1) ◽  
pp. S497-S498
Author(s):  
Mohamad Sater ◽  
Remy Schwab ◽  
Ian Herriott ◽  
Tim Farrell ◽  
Miriam Huntley

Abstract Background Healthcare associated infections (HAIs) are a major contributor to patient morbidity and mortality worldwide. HAIs are increasingly important due to the rise of multidrug resistant pathogens which can lead to deadly nosocomial outbreaks. Current methods for investigating transmissions are slow, costly, or have poor detection resolution. A rapid, cost-effective and high-resolution method to identify transmission events is imperative to guide infection control. Whole genome sequencing of infecting pathogens paired with a single nucleotide polymorphism (SNP) analysis can provide high-resolution clonality determination, yet these methods typically have long turnaround times. Here we examined the utility of the Oxford Nanopore Technologies (ONT) platform, a rapid sequencing technology, for whole genome sequencing based transmission analysis. Methods We developed a SNP calling pipeline customized for ONT data, which exhibit higher sequencing error rates and can therefore be challenging for transmission analysis. The pipeline leverages the latest basecalling tools as well as a suite of custom variant calling and filtering algorithms to achieve highest accuracy in clonality calls compared to Illumina-based sequencing. We also capitalize on ONT long reads by assembling outbreak-specific genomes in order to overcome the need for an external reference genome. Results We examined 20 bacterial isolates from 5 HAI investigations previously performed at Day Zero Diagnostics as part of epiXact®, our commercialized Illumina-based HAI sequencing and analysis service. Using the ONT data and pipeline, we achieved greater than 90% SNP-calling sensitivity and precision, allowing 100% accuracy of clonality classification compared to Illumina-based results across common HAI species. We demonstrate the validity and increased resolution of our SNP analysis pipeline using assembled genomes from each outbreak. We also demonstrate that this ONT-based workflow can produce isolate to transmission determination (i.e. including WGS and analysis) in less than 24 hours. SNP calling performance ONT-based SNP calling sensitivity and precision compared to Illumina-based pipeline Conclusion We demonstrate the utility of ONT for HAI investigation, establishing the potential to transform healthcare epidemiology with same-day high-resolution transmission determination. Disclosures Mohamad Sater, PhD, Day Zero Diagnostics (Employee, Shareholder) Remy Schwab, MSc, Day Zero Diagnostics (Employee, Shareholder) Ian Herriott, BS, Day Zero Diagnostics (Employee, Shareholder) Tim Farrell, MS, Day Zero Diagnostics, Inc. (Employee, Shareholder) Miriam Huntley, PhD, Day Zero Diagnostics (Employee, Shareholder)


2021 ◽  
pp. gr.275579.121
Author(s):  
Daniel P Cooke ◽  
David C Wedge ◽  
Gerton Lunter

Genotyping from sequencing is the basis of emerging strategies in the molecular breeding of polyploid plants. However, compared with the situation for diploids, where genotyping accuracies are confidently determined with comprehensive benchmarks, polyploids have been neglected; there are no benchmarks measuring genotyping error rates for small variants using real sequencing reads. We previously introduced a variant calling method - Octopus - that accurately calls germline variants in diploids and somatic mutations in tumors. Here, we evaluate Octopus and other popular tools on whole-genome tetraploid and hexaploid datasets created using in silico mixtures of diploid Genome In a Bottle (GIAB) samples. We find that genotyping errors are abundant for typical sequencing depths, but that Octopus makes 25% fewer errors than other methods on average. We supplement our benchmarks with concordance analysis in real autotriploid banana datasets.


2021 ◽  
Author(s):  
Daniel P Cooke ◽  
David C Wedge ◽  
Gerton Lunter

Genotyping from sequencing is the basis of emerging strategies in the molecular breeding of polyploid plants. However, compared with the situation for diploids, where genotyping accuracies are confidently determined with comprehensive benchmarks, polyploids have been neglected; there are no benchmarks measuring genotyping error rates for small variants using real sequencing reads. We previously introduced a variant calling method – Octopus – that accurately calls germline variants in diploids and somatic mutations in tumors. Here, we evaluate Octopus and other popular tools on whole-genome tetraploid and hexaploid datasets created using in silico mixtures of diploid Genome In a Bottle samples. We find that genotyping errors are abundant for typical sequencing depths, but that Octopus makes 25% fewer errors than other methods on average. We supplement our benchmarks with concordance analysis in real autotriploid banana datasets.


2021 ◽  
Vol 14 (1) ◽  
Author(s):  
Kelley Paskov ◽  
Jae-Yoon Jung ◽  
Brianna Chrisman ◽  
Nate T. Stockham ◽  
Peter Washington ◽  
...  

Abstract Background As next-generation sequencing technologies make their way into the clinic, knowledge of their error rates is essential if they are to be used to guide patient care. However, sequencing platforms and variant-calling pipelines are continuously evolving, making it difficult to accurately quantify error rates for the particular combination of assay and software parameters used on each sample. Family data provide a unique opportunity for estimating sequencing error rates since it allows us to observe a fraction of sequencing errors as Mendelian errors in the family, which we can then use to produce genome-wide error estimates for each sample. Results We introduce a method that uses Mendelian errors in sequencing data to make highly granular per-sample estimates of precision and recall for any set of variant calls, regardless of sequencing platform or calling methodology. We validate the accuracy of our estimates using monozygotic twins, and we use a set of monozygotic quadruplets to show that our predictions closely match the consensus method. We demonstrate our method’s versatility by estimating sequencing error rates for whole genome sequencing, whole exome sequencing, and microarray datasets, and we highlight its sensitivity by quantifying performance increases between different versions of the GATK variant-calling pipeline. We then use our method to demonstrate that: 1) Sequencing error rates between samples in the same dataset can vary by over an order of magnitude. 2) Variant calling performance decreases substantially in low-complexity regions of the genome. 3) Variant calling performance in whole exome sequencing data decreases with distance from the nearest target region. 4) Variant calls from lymphoblastoid cell lines can be as accurate as those from whole blood. 5) Whole-genome sequencing can attain microarray-level precision and recall at disease-associated SNV sites. Conclusion Genotype datasets from families are powerful resources that can be used to make fine-grained estimates of sequencing error for any sequencing platform and variant-calling methodology.


2015 ◽  
Author(s):  
Ivan Sovic ◽  
Mile Sikic ◽  
Andreas Wilm ◽  
Shannon Nicole Fenlon ◽  
Swaine Chen ◽  
...  

Exploiting the power of nanopore sequencing requires the development of new bioinformatics approaches to deal with its specific error characteristics. We present the first nanopore read mapper (GraphMap) that uses a read-funneling paradigm to robustly handle variable error rates and fast graph traversal to align long reads with speed and very high precision (>95%). Evaluation on MinION sequencing datasets against short and long-read mappers indicates that GraphMap increases mapping sensitivity by at least 15-80%. GraphMap alignments are the first to demonstrate consensus calling with <1 error in 100,000 bases, variant calling on the human genome with 76% improvement in sensitivity over the next best mapper (BWA-MEM), precise detection of structural variants from 100bp to 4kbp in length and species and strain-specific identification of pathogens using MinION reads. GraphMap is available open source under the MIT license at https://github.com/isovic/graphmap.


Genomics ◽  
2007 ◽  
Vol 90 (3) ◽  
pp. 291-296 ◽  
Author(s):  
Ian W. Saunders ◽  
Jesper Brohede ◽  
Garry N. Hannan

2015 ◽  
Author(s):  
Thomas F Cooke ◽  
Muh-Ching Yee ◽  
Marina Muzzio ◽  
Alexandra Sockell ◽  
Ryan Bell ◽  
...  

Reduced representation sequencing methods such as genotyping-by-sequencing (GBS) enable low-cost measurement of genetic variation without the need for a reference genome assembly. These methods are widely used in genetic mapping and population genetics studies, especially with non-model organisms. Variant calling error rates, however, are higher in GBS than in standard sequencing, in particular due to restriction site polymorphisms, and few computational tools exist that specifically model and correct these errors. We developed a statistical method to remove errors caused by restriction site polymorphisms, implemented in the software package GBStools. We evaluated it in several simulated data sets, varying in number of samples, mean coverage and population mutation rate, and in two empirical human data sets (N = 8 and N = 63 samples). In our simulations, GBStools improved genotype accuracy more than commonly used filters such as Hardy-Weinberg equilibrium p-values. GBStools is most effective at removing genotype errors in data sets over 100 samples when coverage is 40X or higher, and the improvement is most pronounced in species with high genomic diversity. We also demonstrate the utility of GBS and GBStools for human population genetic inference in Argentine populations and reveal widely varying individual ancestry proportions and an excess of singletons, consistent with recent population growth.


2015 ◽  
Vol 795 ◽  
pp. 16-23
Author(s):  
Dagmar Caganova ◽  
Ivan Szilva ◽  
Manan Bawa

This paper was carried out with an aim to use a modified knowledge management model, that could provide a knowledge management tool that would ensure decreased downtimes, error rates and could enhance knowledge transfer for a company. This research involved quantitative assessment in measuring knowledge management success by using designed methodologies accompanied by data collection from the quality control department. The results of this research were the measurement of the knowledge management tool success.


Sign in / Sign up

Export Citation Format

Share Document