A robust benchmark for evaluating and improving mosaic variant calling strategies

Abstract The rapid advances in sequencing and analysis technologies have enabled the accurate detection of diverse forms of genomic variants, including germline, somatic, and mosaic mutations. However, unlike for the former two mutations, the best practices for mosaic variant calling still remain chaotic due to the technical and conceptual difficulties faced in evaluation. Here, we present our benchmark of nine feasible strategies for mosaic variant detection based on a systematically designed reference standard that mimics mosaic samples, with 390,153 control positive and 35,208,888 negative single-nucleotide variants and insertion–deletion mutations. We identified the condition-dependent strengths and weaknesses of the current strategies, instead of a single winner, regarding variant allele frequencies, variant sharing, and the usage of control samples. Moreover, feature-level investigation directs the way for immediate to prolonged improvements in mosaic variant calling. Our results will guide researchers in selecting suitable calling algorithms and suggest future strategies for developers.

Download Full-text

Unsuspected somatic mosaicism for FBN1 gene contributes to Marfan syndrome

Genetics in Medicine ◽

10.1038/s41436-020-01078-6 ◽

2021 ◽

Author(s):

Pauline Arnaud ◽

Hélène Morel ◽

Olivier Milleron ◽

Laurent Gouya ◽

Christine Francannet ◽

...

Keyword(s):

Marfan Syndrome ◽

Somatic Mosaicism ◽

Variant Calling ◽

Copy Number Variations ◽

Pathogenic Variant ◽

Single Nucleotide Variants ◽

Bioinformatics Analyses ◽

Single Nucleotide ◽

Fbn1 Gene ◽

Pathogenic Variants

Abstract Purpose Individuals with mosaic pathogenic variants in the FBN1 gene are mainly described in the course of familial screening. In the literature, almost all these mosaic individuals are asymptomatic. In this study, we report the experience of our team on more than 5,000 Marfan syndrome (MFS) probands. Methods Next-generation sequencing (NGS) capture technology allowed us to identify five cases of MFS probands who harbored a mosaic pathogenic variant in the FBN1 gene. Results These five sporadic mosaic probands displayed classical features usually seen in Marfan syndrome. Combined with the results of the literature, these rare findings concerned both single-nucleotide variants and copy-number variations. Conclusion This underestimated finding should not be overlooked in the molecular diagnosis of MFS patients and warrants an adaptation of the parameters used in bioinformatics analyses. The five present cases of symptomatic MFS probands harboring a mosaic FBN1 pathogenic variant reinforce the fact that apparently asymptomatic mosaic parents should have a complete clinical examination and a regular cardiovascular follow-up. We advise that individuals with a typical MFS for whom no single-nucleotide pathogenic variant or exon deletion/duplication was identified should be tested by NGS capture panel with an adapted variant calling analysis.

Download Full-text

scSNV: accurate dscRNA-seq SNV co-expression analysis using duplicate tag collapsing

Genome Biology ◽

10.1186/s13059-021-02364-5 ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Gavin W. Wilson ◽

Mathieu Derouet ◽

Gail E. Darling ◽

Jonathan C. Yeung

Keyword(s):

Genetic Variants ◽

False Positive ◽

Variant Calling ◽

Call Rate ◽

Rna Seq ◽

Single Nucleotide Variants ◽

Single Nucleotide ◽

Variant Call ◽

Two Samples ◽

Co Detection

AbstractIdentifying single nucleotide variants has become common practice for droplet-based single-cell RNA-seq experiments; however, presently, a pipeline does not exist to maximize variant calling accuracy. Furthermore, molecular duplicates generated in these experiments have not been utilized to optimally detect variant co-expression. Herein, we introduce scSNV designed from the ground up to “collapse” molecular duplicates and accurately identify variants and their co-expression. We demonstrate that scSNV is fast, with a reduced false-positive variant call rate, and enables the co-detection of genetic variants and A>G RNA edits across twenty-two samples.

Download Full-text

Clonal propagation history shapes the intra-cultivar genetic diversity in ‘Malbec’ grapevines

10.1101/2020.10.27.356790 ◽

2020 ◽

Author(s):

Luciano Calderón ◽

Nuria Mauri ◽

Claudio Muñoz ◽

Pablo Carbonell-Bejerano ◽

Laura Bree ◽

...

Keyword(s):

Genetic Diversity ◽

Clonal Propagation ◽

Variant Calling ◽

Red Wines ◽

Single Nucleotide Variants ◽

Single Nucleotide ◽

Diversity Pattern ◽

History Of ◽

Potential Impact ◽

Whole Genome Resequencing

AbstractGrapevine (Vitis vinifera L.) cultivars are clonally propagated to preserve their varietal attributes. However, novel genetic variation still accumulates due to somatic mutations. Aiming to study the potential impact of clonal propagation history on grapevines intra-cultivar genetic diversity, we have focused on ‘Malbec’. This cultivar is appreciated for red wines elaboration, it was originated in Southwestern France and introduced into Argentina during the 1850s. Here, we generated whole-genome resequencing data for four ‘Malbec’ clones with different historical backgrounds. A stringent variant calling procedure was established to identify reliable clonal polymorphisms, additionally corroborated by Sanger sequencing. This analysis retrieved 941 single nucleotide variants (SNVs), occurring among the analyzed clones. Based on a set of validated SNVs, a genotyping experiment was custom-designed to survey ‘Malbec’ genetic diversity. We successfully genotyped 214 samples and identified 14 different clonal genotypes, that clustered into two genetically divergent groups. Group-Ar was driven by clones with a long history of clonal propagation in Argentina, while Group-Fr was driven by clones that have longer remained in Europe. Findings show the ability of such approaches for clonal genotypes identification in grapevines. In particular, we provide evidence on how human actions may have shaped ‘Malbec’ extant genetic diversity pattern.

Download Full-text

Misannotation of multiple-nucleotide variants risks misdiagnosis

Wellcome Open Research ◽

10.12688/wellcomeopenres.15420.1 ◽

2019 ◽

Vol 4 ◽

pp. 145

Author(s):

Matthew N. Wakeling ◽

Thomas W. Laver ◽

Kevin Colclough ◽

Andrew Parish ◽

Sian Ellard ◽

...

Keyword(s):

Best Practices ◽

False Negative ◽

Simulated Data ◽

Sequencing Analysis ◽

Sequencing Data ◽

Single Nucleotide Variants ◽

Single Nucleotide ◽

Public Resources ◽

Next Generation Sequencing Analysis ◽

Optimal Approach

Multiple Nucleotide Variants (MNVs) are miscalled by the most widely utilised next generation sequencing analysis (NGS) pipelines, presenting the potential for missing diagnoses that would previously have been made by standard Sanger (dideoxy) sequencing. These variants, which should be treated as a single insertion-deletion mutation event, are commonly called as separate single nucleotide variants. This can result in misannotation, incorrect amino acid predictions and potentially false positive and false negative diagnostic results. This risk will be increased as confirmatory Sanger sequencing of Single Nucleotide variants (SNVs) ceases to be standard practice. Using simulated data and re-analysis of sequencing data from a diagnostic targeted gene panel, we demonstrate that the widely adopted pipeline, GATK best practices, results in miscalling of MNVs and that alternative tools can call these variants correctly. The adoption of calling methods that annotate MNVs correctly would present a solution for individual laboratories, however GATK best practices are the basis for important public resources such as the gnomAD database. We suggest integrating a solution into these guidelines would be the optimal approach.

Download Full-text

Haplotype-aware variant calling enables high accuracy in nanopore long-reads using deep neural networks

10.1101/2021.03.04.433952 ◽

2021 ◽

Author(s):

Kishwar Shafin ◽

Trevor Pesout ◽

Pi-Chuan Chang ◽

Maria Nattestad ◽

Alexey Kolesnikov ◽

...

Keyword(s):

De Novo ◽

Sequence Data ◽

Variant Calling ◽

High Accuracy ◽

Superior Performance ◽

Read Length ◽

Single Nucleotide Variants ◽

Single Nucleotide ◽

Short Read ◽

Long Read

Long-read sequencing has the potential to transform variant detection by reaching currently difficult-to-map regions and routinely linking together adjacent variations to enable read based phasing. Third-generation nanopore sequence data has demonstrated a long read length, but current interpretation methods for its novel pore-based signal have unique error profiles, making accurate analysis challenging. Here, we introduce a haplotype-aware variant calling pipeline PEPPER-Margin-DeepVariant that produces state-of-the-art variant calling results with nanopore data. We show that our nanopore-based method outperforms the short-read-based single nucleotide variant identification method at the whole genome-scale and produces high-quality single nucleotide variants in segmental duplications and low-mappability regions where short-read based genotyping fails. We show that our pipeline can provide highly-contiguous phase blocks across the genome with nanopore reads, contiguously spanning between 85% to 92% of annotated genes across six samples. We also extend PEPPER-Margin-DeepVariant to PacBio HiFi data, providing an efficient solution with superior performance than the current WhatsHap-DeepVariant standard. Finally, we demonstrate de novo assembly polishing methods that use nanopore and PacBio HiFi reads to produce diploid assemblies with high accuracy (Q35+ nanopore-polished and Q40+ PacBio-HiFi-polished).

Download Full-text

Single-nucleotide variants in human RNA: RNA editing and beyond

Briefings in Functional Genomics ◽

10.1093/bfgp/ely032 ◽

2018 ◽

Vol 18 (1) ◽

pp. 30-39 ◽

Cited By ~ 4

Author(s):

Yan Guo ◽

Hui Yu ◽

David C Samuels ◽

Wei Yue ◽

Scott Ness ◽

...

Keyword(s):

Rna Editing ◽

Rna Seq ◽

Rna Modifications ◽

Single Nucleotide Variants ◽

Single Nucleotide ◽

Genomic Variants ◽

High Prevalence ◽

History Of ◽

Gene Expression Quantification

Abstract Through analysis of paired high-throughput DNA-Seq and RNA-Seq data, researchers quickly recognized that RNA-Seq can be used for more than just gene expression quantification. The alternative applications of RNA-Seq data are abundant, and we are particularly interested in its usefulness for detecting single-nucleotide variants, which arise from RNA editing, genomic variants and other RNA modifications. A stunning discovery made from RNA-Seq analyses is the unexpectedly high prevalence of RNA-editing events, many of which cannot be explained by known RNA-editing mechanisms. Over the past 6–7 years, substantial efforts have been made to maximize the potential of RNA-Seq data. In this review we describe the controversial history of mining RNA-editing events from RNA-Seq data and the corresponding development of methodologies to identify, predict, assess the quality of and catalog RNA-editing events as well as genomic variants.

Download Full-text

SomatoSim: precision simulation of somatic single nucleotide variants

BMC Bioinformatics ◽

10.1186/s12859-021-04024-8 ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Marwan A. Hawari ◽

Celine S. Hong ◽

Leslie G. Biesecker

Keyword(s):

High Throughput Sequencing ◽

Variant Calling ◽

Simulated Data ◽

Sequencing Data ◽

Single Nucleotide Variants ◽

Single Nucleotide ◽

Somatic Variant ◽

Simulation Tools ◽

Gold Standard Dataset ◽

High Level

Abstract Background Somatic single nucleotide variants have gained increased attention because of their role in cancer development and the widespread use of high-throughput sequencing techniques. The necessity to accurately identify these variants in sequencing data has led to a proliferation of somatic variant calling tools. Additionally, the use of simulated data to assess the performance of these tools has become common practice, as there is no gold standard dataset for benchmarking performance. However, many existing somatic variant simulation tools are limited because they rely on generating entirely synthetic reads derived from a reference genome or because they do not allow for the precise customizability that would enable a more focused understanding of single nucleotide variant calling performance. Results SomatoSim is a tool that lets users simulate somatic single nucleotide variants in sequence alignment map (SAM/BAM) files with full control of the specific variant positions, number of variants, variant allele fractions, depth of coverage, read quality, and base quality, among other parameters. SomatoSim accomplishes this through a three-stage process: variant selection, where candidate positions are selected for simulation, variant simulation, where reads are selected and mutated, and variant evaluation, where SomatoSim summarizes the simulation results. Conclusions SomatoSim is a user-friendly tool that offers a high level of customizability for simulating somatic single nucleotide variants. SomatoSim is available at https://github.com/BieseckerLab/SomatoSim.

Download Full-text

A short plus long-amplicon based sequencing approach improves genomic coverage and variant detection in the SARS-CoV-2 genome

PLoS ONE ◽

10.1371/journal.pone.0261014 ◽

2022 ◽

Vol 17 (1) ◽

pp. e0261014

Author(s):

Carlos Arana ◽

Chaoying Liang ◽

Matthew Brock ◽

Bo Zhang ◽

Jinchun Zhou ◽

...

Keyword(s):

Virus Genome ◽

Positive Control ◽

Nasopharyngeal Swab ◽

Single Nucleotide Variants ◽

Single Nucleotide ◽

Spike Gene ◽

Synonymous Mutations ◽

Variant Detection ◽

Variant Analysis ◽

New Mutations

High viral transmission in the COVID-19 pandemic has enabled SARS‐CoV‐2 to acquire new mutations that may impact genome sequencing methods. The ARTIC.v3 primer pool that amplifies short amplicons in a multiplex-PCR reaction is one of the most widely used methods for sequencing the SARS-CoV-2 genome. We observed that some genomic intervals are poorly captured with ARTIC primers. To improve the genomic coverage and variant detection across these intervals, we designed long amplicon primers and evaluated the performance of a short (ARTIC) plus long amplicon (MRL) sequencing approach. Sequencing assays were optimized on VR-1986D-ATCC RNA followed by sequencing of nasopharyngeal swab specimens from fifteen COVID-19 positive patients. ARTIC data covered 94.47% of the virus genome fraction in the positive control and patient samples. Variant analysis in the ARTIC data detected 217 mutations, including 209 single nucleotide variants (SNVs) and eight insertions & deletions. On the other hand, long-amplicon data detected 156 mutations, of which 80% were concordant with ARTIC data. Combined analysis of ARTIC + MRL data improved the genomic coverage to 97.03% and identified 214 high confidence mutations. The combined final set of 214 mutations included 203 SNVs, 8 deletions and 3 insertions. Analysis showed 26 SARS-CoV-2 lineage defining mutations including 4 known variants of concern K417N, E484K, N501Y, P618H in spike gene. Hybrid analysis identified 7 nonsynonymous and 5 synonymous mutations across the genome that were either ambiguous or not called in ARTIC data. For example, G172V mutation in the ORF3a protein and A2A mutation in Membrane protein were missed by the ARTIC assay. Thus, we show that while the short amplicon (ARTIC) assay provides good genomic coverage with high throughput, complementation of poorly captured intervals with long amplicon data can significantly improve SARS-CoV-2 genomic coverage and variant detection.

Download Full-text

UMI-Gen: a UMI-based reads simulator for variant calling evaluation in paired-end sequencing NGS libraries

10.1101/2020.04.22.027532 ◽

2020 ◽

Author(s):

Vincent Sater ◽

Pierre-Julien Viailly ◽

Thierry Lecroq ◽

Philippe Ruminy ◽

Caroline Bérard ◽

...

Keyword(s):

Variant Calling ◽

Copy Number Variations ◽

Biological Data ◽

Single Nucleotide Variants ◽

Background Error ◽

Single Nucleotide ◽

Low Frequencies ◽

Paired End Sequencing ◽

Very High ◽

Generation Sequencing

AbstractMotivationWith Next Generation Sequencing becoming more affordable every year, NGS technologies asserted themselves as the fastest and most reliable way to detect Single Nucleotide Variants (SNV) and Copy Number Variations (CNV) in cancer patients. These technologies can be used to sequence DNA at very high depths thus allowing to detect abnormalities in tumor cells with very low frequencies. A lot of different variant callers are publicly available and usually do a good job at calling out variants. However, when frequencies begin to drop under 1%, the specificity of these tools suffers greatly as true variants at very low frequencies can be easily confused with sequencing or PCR artifacts. The recent use of Unique Molecular Identifiers (UMI) in NGS experiments offered a way to accurately separate true variants from artifacts. UMI-based variant callers are slowly replacing raw-reads based variant callers as the standard method for an accurate detection of variants at very low frequencies. However, benchmarking done in the tools publication are usually realized on real biological data in which real variants are not known, making it difficult to assess their accuracy.ResultsWe present UMI-Gen, a UMI-based reads simulator for targeted sequencing paired-end data. UMI-Gen generates reference reads covering the targeted regions at a user customizable depth. After that, using a number of control files, it estimates the background error rate at each position and then modifies the generated reads to mimic real biological data. Finally, it will insert real variants in the reads from a list provided by the user.AvailabilityThe entire pipeline is available at https://gitlab.com/vincent-sater/umigen-master under MIT [email protected]

Download Full-text

Whole-genome sequencing of a Brazilian naturalized horse breed resistant to arid climate for identifying single nucleotide variants and insertions/deletions

10.21203/rs.3.rs-21956/v1 ◽

2020 ◽

Author(s):

Danielle Cunha Cardoso ◽

Eduardo Geraldo Alves Coelho ◽

Brenda Neves Porto ◽

glacy silva ◽

Denea de Araújo Fernandes Pires ◽

...

Keyword(s):

Variant Calling ◽

Olfactory Receptors ◽

Enrichment Analysis ◽

Functional Enrichment Analysis ◽

Functional Enrichment ◽

Single Nucleotide Variants ◽

Horse Breed ◽

Single Nucleotide ◽

Arid Conditions ◽

Single Nucleotide Variations

Abstract Background: In this study, we perform a search for variants (SNVs and InDels) in the genome of a Brazilian Naturalized horse breed, using FreeBayes and GATK variant calling tools. This breed presents exclusive adaptive traits of extreme importance to semi-arid conditions, such as those that allow survival under excessive sunlight, rainfall, low forage availability and stony ground. Moreover, these traits are expressed without any detriment to the performance and perpetuation of the breed. Results: A total of 305,588,364 reads were mapped in the horse reference genome, 1,598,210 single nucleotide variations and 138,139 insertions/deletions were detected by FreeBayes, 88,838 (SNVs) and 25,232 (InDels) by GATK. Both have been used in order to increase the safety of variant calls, identify in which regions of the genome they are present and check for variants in genes possibly associated with the peculiar traits exhibited by the breed. Conclusions: The variants annotation identified numerous non-synonymous SNVs and frameshift InDels, which could affect phenotypic variation. We found 28 and 392 Emsembl gene IDs containing high and moderate impact SNVs, including GTPase family members, olfactory receptors, mitochondrial complex and defense genes. Functional enrichment analysis was performed and revealed that variants in the olfactory transduction pathway were overrepresented.

Download Full-text