scholarly journals Benchmarking small-variant genotyping in polyploids

2021 ◽  
pp. gr.275579.121
Author(s):  
Daniel P Cooke ◽  
David C Wedge ◽  
Gerton Lunter

Genotyping from sequencing is the basis of emerging strategies in the molecular breeding of polyploid plants. However, compared with the situation for diploids, where genotyping accuracies are confidently determined with comprehensive benchmarks, polyploids have been neglected; there are no benchmarks measuring genotyping error rates for small variants using real sequencing reads. We previously introduced a variant calling method - Octopus - that accurately calls germline variants in diploids and somatic mutations in tumors. Here, we evaluate Octopus and other popular tools on whole-genome tetraploid and hexaploid datasets created using in silico mixtures of diploid Genome In a Bottle (GIAB) samples. We find that genotyping errors are abundant for typical sequencing depths, but that Octopus makes 25% fewer errors than other methods on average. We supplement our benchmarks with concordance analysis in real autotriploid banana datasets.

2021 ◽  
Author(s):  
Daniel P Cooke ◽  
David C Wedge ◽  
Gerton Lunter

Genotyping from sequencing is the basis of emerging strategies in the molecular breeding of polyploid plants. However, compared with the situation for diploids, where genotyping accuracies are confidently determined with comprehensive benchmarks, polyploids have been neglected; there are no benchmarks measuring genotyping error rates for small variants using real sequencing reads. We previously introduced a variant calling method – Octopus – that accurately calls germline variants in diploids and somatic mutations in tumors. Here, we evaluate Octopus and other popular tools on whole-genome tetraploid and hexaploid datasets created using in silico mixtures of diploid Genome In a Bottle samples. We find that genotyping errors are abundant for typical sequencing depths, but that Octopus makes 25% fewer errors than other methods on average. We supplement our benchmarks with concordance analysis in real autotriploid banana datasets.


2021 ◽  
Vol 11 (1) ◽  
Author(s):  
Luciano Calderón ◽  
Nuria Mauri ◽  
Claudio Muñoz ◽  
Pablo Carbonell-Bejerano ◽  
Laura Bree ◽  
...  

AbstractGrapevine cultivars are clonally propagated to preserve their varietal attributes. However, genetic variations accumulate due to the occurrence of somatic mutations. This process is anthropically influenced through plant transportation, clonal propagation and selection. Malbec is a cultivar that is well-appreciated for the elaboration of red wine. It originated in Southwestern France and was introduced in Argentina during the 1850s. In order to study the clonal genetic diversity of Malbec grapevines, we generated whole-genome resequencing data for four accessions with different clonal propagation records. A stringent variant calling procedure was established to identify reliable polymorphisms among the analyzed accessions. The latter procedure retrieved 941 single nucleotide variants (SNVs). A reduced set of the detected SNVs was corroborated through Sanger sequencing, and employed to custom-design a genotyping experiment. We successfully genotyped 214 Malbec accessions using 41 SNVs, and identified 14 genotypes that clustered in two genetically divergent clonal lineages. These lineages were associated with the time span of clonal propagation of the analyzed accessions in Argentina and Europe. Our results show the usefulness of this approach for the study of the scarce intra-cultivar genetic diversity in grapevines. We also provide evidence on how human actions might have driven the accumulation of different somatic mutations, ultimately shaping the Malbec genetic diversity pattern.


2021 ◽  
Vol 14 (1) ◽  
Author(s):  
Kelley Paskov ◽  
Jae-Yoon Jung ◽  
Brianna Chrisman ◽  
Nate T. Stockham ◽  
Peter Washington ◽  
...  

Abstract Background As next-generation sequencing technologies make their way into the clinic, knowledge of their error rates is essential if they are to be used to guide patient care. However, sequencing platforms and variant-calling pipelines are continuously evolving, making it difficult to accurately quantify error rates for the particular combination of assay and software parameters used on each sample. Family data provide a unique opportunity for estimating sequencing error rates since it allows us to observe a fraction of sequencing errors as Mendelian errors in the family, which we can then use to produce genome-wide error estimates for each sample. Results We introduce a method that uses Mendelian errors in sequencing data to make highly granular per-sample estimates of precision and recall for any set of variant calls, regardless of sequencing platform or calling methodology. We validate the accuracy of our estimates using monozygotic twins, and we use a set of monozygotic quadruplets to show that our predictions closely match the consensus method. We demonstrate our method’s versatility by estimating sequencing error rates for whole genome sequencing, whole exome sequencing, and microarray datasets, and we highlight its sensitivity by quantifying performance increases between different versions of the GATK variant-calling pipeline. We then use our method to demonstrate that: 1) Sequencing error rates between samples in the same dataset can vary by over an order of magnitude. 2) Variant calling performance decreases substantially in low-complexity regions of the genome. 3) Variant calling performance in whole exome sequencing data decreases with distance from the nearest target region. 4) Variant calls from lymphoblastoid cell lines can be as accurate as those from whole blood. 5) Whole-genome sequencing can attain microarray-level precision and recall at disease-associated SNV sites. Conclusion Genotype datasets from families are powerful resources that can be used to make fine-grained estimates of sequencing error for any sequencing platform and variant-calling methodology.


Author(s):  
Russ Jasper ◽  
Tegan Krista McDonald ◽  
Pooja Singh ◽  
Mengmeng Lu ◽  
Clément Rougeux ◽  
...  

The use of NGS datasets has increased dramatically over the last decade, however, there have been few systematic analyses quantifying the accuracy of the commonly used variant caller programs. Here we used a familial design consisting of diploid tissue from a single Pinus contorta parent and the maternally derived haploid tissue from 106 full-sibling offspring, where mismatches could only arise due to mutation or bioinformatic error. Given the rarity of mutation, we used the rate of mismatches between parent and offspring genotype calls to infer the SNP genotyping error rates of FreeBayes, HaplotypeCaller, SAMtools, UnifiedGenotyper, and VarScan. With baseline filtering HaplotypeCaller and UnifiedGenotyper yielded one to two orders of magnitude larger numbers of SNPs and error rates, whereas FreeBayes, SAMtools and VarScan yielded lower numbers of SNPs and more modest error rates. To facilitate comparison between variant callers we standardized each SNP set to the same number of SNPs using additional filtering, where UnifiedGenotyper consistently produced the smallest proportion of genotype errors, followed by HaplotypeCaller, VarScan, SAMtools, and FreeBayes. Additionally, we found that error rates were minimized for SNPs called by more than one variant caller. Finally, we evaluated the performance of various commonly used filtering metrics on SNP calling. Our analysis provides a quantitative assessment of the accuracy of five widely used variant calling programs and offers valuable insights into both the choice of variant caller program and the choice of filtering metrics, especially for researchers using non-model study systems.


2021 ◽  
Vol 8 (Supplement_1) ◽  
pp. S497-S498
Author(s):  
Mohamad Sater ◽  
Remy Schwab ◽  
Ian Herriott ◽  
Tim Farrell ◽  
Miriam Huntley

Abstract Background Healthcare associated infections (HAIs) are a major contributor to patient morbidity and mortality worldwide. HAIs are increasingly important due to the rise of multidrug resistant pathogens which can lead to deadly nosocomial outbreaks. Current methods for investigating transmissions are slow, costly, or have poor detection resolution. A rapid, cost-effective and high-resolution method to identify transmission events is imperative to guide infection control. Whole genome sequencing of infecting pathogens paired with a single nucleotide polymorphism (SNP) analysis can provide high-resolution clonality determination, yet these methods typically have long turnaround times. Here we examined the utility of the Oxford Nanopore Technologies (ONT) platform, a rapid sequencing technology, for whole genome sequencing based transmission analysis. Methods We developed a SNP calling pipeline customized for ONT data, which exhibit higher sequencing error rates and can therefore be challenging for transmission analysis. The pipeline leverages the latest basecalling tools as well as a suite of custom variant calling and filtering algorithms to achieve highest accuracy in clonality calls compared to Illumina-based sequencing. We also capitalize on ONT long reads by assembling outbreak-specific genomes in order to overcome the need for an external reference genome. Results We examined 20 bacterial isolates from 5 HAI investigations previously performed at Day Zero Diagnostics as part of epiXact®, our commercialized Illumina-based HAI sequencing and analysis service. Using the ONT data and pipeline, we achieved greater than 90% SNP-calling sensitivity and precision, allowing 100% accuracy of clonality classification compared to Illumina-based results across common HAI species. We demonstrate the validity and increased resolution of our SNP analysis pipeline using assembled genomes from each outbreak. We also demonstrate that this ONT-based workflow can produce isolate to transmission determination (i.e. including WGS and analysis) in less than 24 hours. SNP calling performance ONT-based SNP calling sensitivity and precision compared to Illumina-based pipeline Conclusion We demonstrate the utility of ONT for HAI investigation, establishing the potential to transform healthcare epidemiology with same-day high-resolution transmission determination. Disclosures Mohamad Sater, PhD, Day Zero Diagnostics (Employee, Shareholder) Remy Schwab, MSc, Day Zero Diagnostics (Employee, Shareholder) Ian Herriott, BS, Day Zero Diagnostics (Employee, Shareholder) Tim Farrell, MS, Day Zero Diagnostics, Inc. (Employee, Shareholder) Miriam Huntley, PhD, Day Zero Diagnostics (Employee, Shareholder)


Author(s):  
Russ Jasper ◽  
Tegan Krista McDonald ◽  
Pooja Singh ◽  
Menhmeng Lu ◽  
Clément Rougeux ◽  
...  

The use of NGS datasets has increased dramatically over the last decade, however, there have been few systematic analyses quantifying the accuracy of the commonly used variant caller programs. Here we used a familial design consisting of diploid tissue from a single Pinus contorta parent and the maternally derived haploid tissue from 106 full-sibling offspring, where mismatches could only arise due to mutation or bioinformatic error. Given the rarity of mutation, we used the rate of mismatches between parent and offspring genotype calls to infer the SNP genotyping error rates of FreeBayes, HaplotypeCaller, SAMtools, UnifiedGenotyper, and VarScan. With baseline filtering HaplotypeCaller and UnifiedGenotyper yielded one to two orders of magnitude larger numbers of SNPs and error rates, whereas FreeBayes, SAMtools and VarScan yielded lower numbers of SNPs and more modest error rates. To facilitate comparison between variant callers we standardized each SNP set to the same number of SNPs using additional filtering, where UnifiedGenotyper consistently produced the smallest proportion of genotype errors, followed by HaplotypeCaller, VarScan, SAMtools, and FreeBayes. Additionally, we found that error rates were minimized for SNPs called by more than one variant caller. Finally, we evaluated the performance of various commonly used filtering metrics on SNP calling. Our analysis provides a quantitative assessment of the accuracy of five widely used variant calling programs and offers valuable insights into both the choice of variant caller program and the choice of filtering metrics, especially for researchers using non-model study systems.


2019 ◽  
Author(s):  
Luisa Bresadola ◽  
Vivian Link ◽  
C. Alex Buerkle ◽  
Christian Lexer ◽  
Daniel Wegmann

AbstractIn non-model organisms, evolutionary questions are frequently addressed using reduced representation sequencing techniques due to their low cost, ease of use, and because they do not require genomic resources such as a reference genome. However, evidence is accumulating that such techniques may be affected by specific biases, questioning the accuracy of obtained genotypes, and as a consequence, their usefulness in evolutionary studies. Here we introduce three strategies to estimate genotyping error rates from such data: through the comparison to high quality genotypes obtained with a different technique, from individual replicates, or from a population sample when assuming Hardy-Weinberg equilibrium. Applying these strategies to data obtained with Restriction site Associated DNA sequencing (RAD-seq), arguably the most popular reduced representation sequencing technique, revealed per-allele genotyping error rates that were much higher than sequencing error rates, particularly at heterozygous sites that were wrongly inferred as homozygous. As we exemplify through the inference of genome-wide and local ancestry of well characterized hybrids of two Eurasian poplar (Populus) species, such high error rates may lead to wrong biological conclusions. By properly accounting for these error rates in downstream analyses, either by incorporating genotyping errors directly or by recalibrating genotype likelihoods, we were nevertheless able to use the RAD-seq data to support biologically meaningful and robust inferences of ancestry among Populus hybrids. Based on these findings, we strongly recommend carefully assessing genotyping error rates in reduced representation sequencing experiments, and to properly account for these in downstream analyses, for instance using the tools presented here.


Blood ◽  
2012 ◽  
Vol 120 (21) ◽  
pp. 1383-1383
Author(s):  
Adam Burns ◽  
Ruth Clifford ◽  
Helene Dreau ◽  
Chris S.R Hatton ◽  
Shirley Henderson ◽  
...  

Abstract Abstract 1383 Explorative genome-wide next-generation sequencing of leukaemias and lymphomas has revealed a wide spectrum of acquired mutations and considerable tumour heterogeneity that might be responsible for disease initiation, resistance to treatments and relapse. There is, therefore, a clinical need to identify these genetic abnormalities in a diagnostic setting. Here, we present the development and validation of a targeted next generation mutation analysis tool. To compare the distribution pattern of genetic abnormalities in chronic lymphocytic leukemia (CLL), we performed targeted deep sequencing on CLL samples using a TruSeq custom designed targeted amplicon assay (TSCA, Illumina). We reveal differential mutation distribution patterns depending on clinical CLL subgroups. The TSCA panel was designed to amplify 21 genes (table 1) with known or suspected links to either the development of CLL or as response predictors, including TP53, SF3B1 (Puente, Nature, 2011; Quesada et al, 2012) and NOTCH1 (Rossi, Blood, 2012). Where genes have known mutational hotspots in CLL, only these regions were included in our panel, for example exons 5–8 of TP53. For genes such as MAP2K1, where mutations are distributed throughout the coding region, every exon was targeted. In total, we were able to design an amplicon panel able to cover 99% of our desired 36,035bp target region. Table 1. List of genes included in CLL custom amplicon panel ASXL1 ATM CHD2 DDX3X FBXW7 HMCN1 IRF4 KLHL6 LRP1B MAP2K1 MAPK1 MED12 NOTCH1 PCLO POT1 SAMHD1 SF3B1 TP53 XPO1 ZFPM2 ZMYM3 In order to validate our approach, we used samples previously subjected to whole genome sequencing as controls. Of the 13 individual mutations in the control cohort, we were successfully able to detect 10 (77%) with our custom assay to an average depth of 1380x. A 19bp deletion in TP53 failed to be picked up by the variant calling software, and 2 point mutations in ATM were not detected due to the targeted nature of the assay. There was a single false positive mutation across all samples in ZFPM2, caused by a sequencing error in a homopolymer region. The sample group consisted of 45 representative CLL cases, split into two cohorts. The first cohort consisted of 11 cases that have yet to receive any treatment, whilst the second cohort comprised 34 relapsed/refractory cases. Analysis of further samples is in progress. We performed library preparation according to the manufacturers instructions. Each sample was dual indexed with two 8bp “barcodes” prior to equimolar pooling, and the final pooled library was processed on an Illumina MiSeq instrument using the TruSeq 2×150bp paired end sequencing protocol. The run produced 1.6Gb of passed filter sequence data, with 92.8% of above the quality threshold of Q30. The average depth of coverage across all samples was 849x. Primary analysis of the sequencing data was performed using the cloud based data analysis package from Illumina, which carried out the alignment and variant calling. A conservative quality score threshold of >99 was set, with all variants above this carried forward for further analysis. Our custom amplicon panel detected mutations in 35 of the samples, comprising 8 indels and 45 point mutations. Of the 54 mutations, 40 were missense, 8 were frame-shifts, 1 was a nonsense mutation and 5 are predicted to have functional effects on splicing domains. The most frequently mutated gene was TP53, followed by SF3B1, PCLO and NOTCH1 (figure 1). Fig 1 Frequency of genes with somatic mutations in our CLL cohort. Fig 1. Frequency of genes with somatic mutations in our CLL cohort. Importantly, there was good correlation between mutation allele frequencies from whole genome sequencing, targeted deep sequencing and TSCA, demonstrating that the high sensitivity of large-scale genome sequencers can be reliably applied in a diagnostic setting. We describe mutation hotspots and mutation distribution patterns and link them to clinical behaviour. For example: SF3B1 mutations occurred in 15% of patients and were linked to reduced progression free survival. In conclusion, our technique allows for rapid mutation detection of the most frequently mutated genes in CLL. Further refinements in amplicon design and variant calling will lead to added precision. TSCA design and validation for other haematological diseases is in progress. Disclosures: No relevant conflicts of interest to declare.


2021 ◽  
Vol 9 (8) ◽  
pp. 1585
Author(s):  
Ana C. Reis ◽  
Liliana C. M. Salvador ◽  
Suelee Robbe-Austerman ◽  
Rogério Tenreiro ◽  
Ana Botelho ◽  
...  

Classical molecular analyses of Mycobacterium bovis based on spoligotyping and Variable Number Tandem Repeat (MIRU-VNTR) brought the first insights into the epidemiology of animal tuberculosis (TB) in Portugal, showing high genotypic diversity of circulating strains that mostly cluster within the European 2 clonal complex. Previous surveillance provided valuable information on the prevalence and spatial occurrence of TB and highlighted prevalent genotypes in areas where livestock and wild ungulates are sympatric. However, links at the wildlife–livestock interfaces were established mainly via classical genotype associations. Here, we apply whole genome sequencing (WGS) to cattle, red deer and wild boar isolates to reconstruct the M. bovis population structure in a multi-host, multi-region disease system and to explore links at a fine genomic scale between M. bovis from wildlife hosts and cattle. Whole genome sequences of 44 representative M. bovis isolates, obtained between 2003 and 2015 from three TB hotspots, were compared through single nucleotide polymorphism (SNP) variant calling analyses. Consistent with previous results combining classical genotyping with Bayesian population admixture modelling, SNP-based phylogenies support the branching of this M. bovis population into five genetic clades, three with apparent geographic specificities, as well as the establishment of an SNP catalogue specific to each clade, which may be explored in the future as phylogenetic markers. The core genome alignment of SNPs was integrated within a spatiotemporal metadata framework to further structure this M. bovis population by host species and TB hotspots, providing a baseline for network analyses in different epidemiological and disease control contexts. WGS of M. bovis isolates from Portugal is reported for the first time in this pilot study, refining the spatiotemporal context of TB at the wildlife–livestock interface and providing further support to the key role of red deer and wild boar on disease maintenance. The SNP diversity observed within this dataset supports the natural circulation of M. bovis for a long time period, as well as multiple introduction events of the pathogen in this Iberian multi-host system.


Sign in / Sign up

Export Citation Format

Share Document