scholarly journals SomatoSim: precision simulation of somatic single nucleotide variants

2021 ◽  
Vol 22 (1) ◽  
Author(s):  
Marwan A. Hawari ◽  
Celine S. Hong ◽  
Leslie G. Biesecker

Abstract Background Somatic single nucleotide variants have gained increased attention because of their role in cancer development and the widespread use of high-throughput sequencing techniques. The necessity to accurately identify these variants in sequencing data has led to a proliferation of somatic variant calling tools. Additionally, the use of simulated data to assess the performance of these tools has become common practice, as there is no gold standard dataset for benchmarking performance. However, many existing somatic variant simulation tools are limited because they rely on generating entirely synthetic reads derived from a reference genome or because they do not allow for the precise customizability that would enable a more focused understanding of single nucleotide variant calling performance. Results SomatoSim is a tool that lets users simulate somatic single nucleotide variants in sequence alignment map (SAM/BAM) files with full control of the specific variant positions, number of variants, variant allele fractions, depth of coverage, read quality, and base quality, among other parameters. SomatoSim accomplishes this through a three-stage process: variant selection, where candidate positions are selected for simulation, variant simulation, where reads are selected and mutated, and variant evaluation, where SomatoSim summarizes the simulation results. Conclusions SomatoSim is a user-friendly tool that offers a high level of customizability for simulating somatic single nucleotide variants. SomatoSim is available at https://github.com/BieseckerLab/SomatoSim.

2020 ◽  
Vol 36 (9) ◽  
pp. 2725-2730
Author(s):  
Keisuke Shimmura ◽  
Yuki Kato ◽  
Yukio Kawahara

Abstract Motivation Genetic variant calling with high-throughput sequencing data has been recognized as a useful tool for better understanding of disease mechanism and detection of potential off-target sites in genome editing. Since most of the variant calling algorithms rely on initial mapping onto a reference genome and tend to predict many variant candidates, variant calling remains challenging in terms of predicting variants with low false positives. Results Here we present Bivartect, a simple yet versatile variant caller based on direct comparison of short sequence reads between normal and mutated samples. Bivartect can detect not only single nucleotide variants but also insertions/deletions, inversions and their complexes. Bivartect achieves high predictive performance with an elaborate memory-saving mechanism, which allows Bivartect to run on a computer with a single node for analyzing small omics data. Tests with simulated benchmark and real genome-editing data indicate that Bivartect was comparable to state-of-the-art variant callers in positive predictive value for detection of single nucleotide variants, even though it yielded a substantially small number of candidates. These results suggest that Bivartect, a reference-free approach, will contribute to the identification of germline mutations as well as off-target sites introduced during genome editing with high accuracy. Availability and implementation Bivartect is implemented in C++ and available along with in silico simulated data at https://github.com/ykat0/bivartect. Supplementary information Supplementary data are available at Bioinformatics online.


2019 ◽  
Vol 36 (3) ◽  
pp. 713-720 ◽  
Author(s):  
Mary A Wood ◽  
Austin Nguyen ◽  
Adam J Struck ◽  
Kyle Ellrott ◽  
Abhinav Nellore ◽  
...  

Abstract Motivation The vast majority of tools for neoepitope prediction from DNA sequencing of complementary tumor and normal patient samples do not consider germline context or the potential for the co-occurrence of two or more somatic variants on the same mRNA transcript. Without consideration of these phenomena, existing approaches are likely to produce both false-positive and false-negative results, resulting in an inaccurate and incomplete picture of the cancer neoepitope landscape. We developed neoepiscope chiefly to address this issue for single nucleotide variants (SNVs) and insertions/deletions (indels). Results Herein, we illustrate how germline and somatic variant phasing affects neoepitope prediction across multiple datasets. We estimate that up to ∼5% of neoepitopes arising from SNVs and indels may require variant phasing for their accurate assessment. neoepiscope is performant, flexible and supports several major histocompatibility complex binding affinity prediction tools. Availability and implementation neoepiscope is available on GitHub at https://github.com/pdxgx/neoepiscope under the MIT license. Scripts for reproducing results described in the text are available at https://github.com/pdxgx/neoepiscope-paper under the MIT license. Additional data from this study, including summaries of variant phasing incidence and benchmarking wallclock times, are available in Supplementary Files 1, 2 and 3. Supplementary File 1 contains Supplementary Table 1, Supplementary Figures 1 and 2, and descriptions of Supplementary Tables 2–8. Supplementary File 2 contains Supplementary Tables 2–6 and 8. Supplementary File 3 contains Supplementary Table 7. Raw sequencing data used for the analyses in this manuscript are available from the Sequence Read Archive under accessions PRJNA278450, PRJNA312948, PRJNA307199, PRJNA343789, PRJNA357321, PRJNA293912, PRJNA369259, PRJNA305077, PRJNA306070, PRJNA82745 and PRJNA324705; from the European Genome-phenome Archive under accessions EGAD00001004352 and EGAD00001002731; and by direct request to the authors. Supplementary information Supplementary data are available at Bioinformatics online.


2019 ◽  
Vol 4 ◽  
pp. 145
Author(s):  
Matthew N. Wakeling ◽  
Thomas W. Laver ◽  
Kevin Colclough ◽  
Andrew Parish ◽  
Sian Ellard ◽  
...  

Multiple Nucleotide Variants (MNVs) are miscalled by the most widely utilised next generation sequencing analysis (NGS) pipelines, presenting the potential for missing diagnoses that would previously have been made by standard Sanger (dideoxy) sequencing. These variants, which should be treated as a single insertion-deletion mutation event, are commonly called as separate single nucleotide variants. This can result in misannotation, incorrect amino acid predictions and potentially false positive and false negative diagnostic results. This risk will be increased as confirmatory Sanger sequencing of Single Nucleotide variants (SNVs) ceases to be standard practice. Using simulated data and re-analysis of sequencing data from a diagnostic targeted gene panel, we demonstrate that the widely adopted pipeline, GATK best practices, results in miscalling of MNVs and that alternative tools can call these variants correctly. The adoption of calling methods that annotate MNVs correctly would present a solution for individual laboratories, however GATK best practices are the basis for important public resources such as the gnomAD database. We suggest integrating a solution into these guidelines would be the optimal approach.


GigaScience ◽  
2019 ◽  
Vol 8 (8) ◽  
Author(s):  
David R Greig ◽  
Claire Jenkins ◽  
Saheer Gharbia ◽  
Timothy J Dallman

Abstract Background We aimed to compare Illumina and Oxford Nanopore Technology sequencing data from the 2 isolates of Shiga toxin–producing Escherichia coli (STEC) O157:H7 to determine whether concordant single-nucleotide variants were identified and whether inference of relatedness was consistent with the 2 technologies. Results For the Illumina workflow, the time from DNA extraction to availability of results was ∼40 hours, whereas with the ONT workflow serotyping and Shiga toxin subtyping variant identification were available within 7 hours. After optimization of the ONT variant filtering, on average 95% of the discrepant positions between the technologies were accounted for by methylated positions found in the described 5-methylcytosine motif sequences, CC(A/T)GG. Of the few discrepant variants (6 and 7 difference for the 2 isolates) identified by the 2 technologies, it is likely that both methodologies contain false calls. Conclusions Despite these discrepancies, Illumina and Oxford Nanopore Technology sequences from the same case were placed on the same phylogenetic location against a dense reference database of STEC O157:H7 genomes sequenced using the Illumina workflow. Robust single-nucleotide polymorphism typing using MinION-based variant calling is possible, and we provide evidence that the 2 technologies can be used interchangeably to type STEC O157:H7 in a public health setting.


2021 ◽  
Vol 12 (1) ◽  
Author(s):  
David Lähnemann ◽  
Johannes Köster ◽  
Ute Fischer ◽  
Arndt Borkhardt ◽  
Alice C. McHardy ◽  
...  

AbstractAccurate single cell mutational profiles can reveal genomic cell-to-cell heterogeneity. However, sequencing libraries suitable for genotyping require whole genome amplification, which introduces allelic bias and copy errors. The resulting data violates assumptions of variant callers developed for bulk sequencing. Thus, only dedicated models accounting for amplification bias and errors can provide accurate calls. We present ProSolo for calling single nucleotide variants from multiple displacement amplified (MDA) single cell DNA sequencing data. ProSolo probabilistically models a single cell jointly with a bulk sequencing sample and integrates all relevant MDA biases in a site-specific and scalable—because computationally efficient—manner. This achieves a higher accuracy in calling and genotyping single nucleotide variants in single cells in comparison to state-of-the-art tools and supports imputation of insufficiently covered genotypes, when downstream tools cannot handle missing data. Moreover, ProSolo implements the first approach to control the false discovery rate reliably and flexibly. ProSolo is implemented in an extendable framework, with code and usage at: https://github.com/prosolo/prosolo


2018 ◽  
Author(s):  
Bernt Popp ◽  
Mandy Krumbiegel ◽  
Janina Grosch ◽  
Annika Sommer ◽  
Steffen Uebe ◽  
...  

ABSTRACTGenetic integrity of induced pluripotent stem cells (iPSCs) is essential for their validity as disease models and for potential therapeutic use. We describe the comprehensive analysis in the ForIPS consortium: an iPSC collection from donors with neurological diseases and healthy controls. Characterization included pluripotency confirmation, fingerprinting, conventional and molecular karyotyping in all lines. In the majority, somatic copy number variants (CNVs) were identified. A subset with available matched donor DNA was selected for comparative exome sequencing. We identified single nucleotide variants (SNVs) at different allelic frequencies in each clone with high variability in mutational load. Low frequencies of variants in parental fibroblasts highlight the importance of germline samples. Somatic variant number was independent from reprogramming, cell type and passage. Comparison with disease genes and prediction scores suggest biological relevance for some variants. We show that high-throughput sequencing has value beyond SNV detection and the requirement to individually evaluate each clone.


2021 ◽  
Vol 12 (1) ◽  
Author(s):  
Chuanyi Zhang ◽  
Mohammed El-Kebir ◽  
Idoia Ochoa

AbstractIntra-tumor heterogeneity renders the identification of somatic single-nucleotide variants (SNVs) a challenging problem. In particular, low-frequency SNVs are hard to distinguish from sequencing artifacts. While the increasing availability of multi-sample tumor DNA sequencing data holds the potential for more accurate variant calling, there is a lack of high-sensitivity multi-sample SNV callers that utilize these data. Here we report Moss, a method to identify low-frequency SNVs that recur in multiple sequencing samples from the same tumor. Moss provides any existing single-sample SNV caller the ability to support multiple samples with little additional time overhead. We demonstrate that Moss improves recall while maintaining high precision in a simulated dataset. On multi-sample hepatocellular carcinoma, acute myeloid leukemia and colorectal cancer datasets, Moss identifies new low-frequency variants that meet manual review criteria and are consistent with the tumor’s mutational signature profile. In addition, Moss detects the presence of variants in more samples of the same tumor than reported by the single-sample caller. Moss’ improved sensitivity in SNV calling will enable more detailed downstream analyses in cancer genomics.


2019 ◽  
Author(s):  
David R Greig ◽  
Claire Jenkins ◽  
Saheer Gharbia ◽  
Timothy J Dallman

AbstractBackgroundWe aimed to compare Illumina and Oxford Nanopore Technology (ONT) sequencing data from the two isolates of STEC O157:H7 to determine whether concordant single nucleotide variants were identified and whether inference of relatedness was consistent with the two technologies.ResultsFor the Illumina workflow, the time from DNA extraction to availability of results, was approximately 40 hours in comparison to the ONT workflow where serotyping, Shiga toxin subtyping variant identification were available within seven hours. After optimisation of the ONT variant filtering, on average 95% of the discrepant positions between the technologies were accounted for by methylated positions found in the described 5-Methylcytosine motif sequences, CC(A/T)GG. Of the few discrepant variants (6 and 7 difference for the two isolates) identified by the two technologies, it is likely that both methodologies contain false calls.ConclusionsDespite these discrepancies, Illumina and ONT sequences from the same case were placed on the same phylogenetic location against a dense reference database of STEC O157:H7 genomes sequenced using the Illumina workflow. Robust SNP typing using MinION-based variant calling is possible and we provide evidence that the two technologies can be used interchangeably to type STEC O157:H7 in a public health setting.


2019 ◽  
Author(s):  
Harald Detering ◽  
Laura Tomás ◽  
Tamara Prieto ◽  
David Posada

AbstractMultiregional bulk sequencing data is necessary to characterize intratumor genetic heterogeneity. Novel somatic variant calling approaches aim to address the particular characteristics of multiregional data, but it remains unclear to which extent they improve compared to single-sample strategies. Here we compared the performance of 16 single-nucleotide variant calling approaches on multiregional sequencing data under different scenarios with in-silico and real sequencing reads, including varying sequencing coverage and increasing levels of spatial clonal admixture. Under the conditions simulated, methods that use information across multiple samples do not necessarily perform better than some of the standard calling methods that work sample by sample. Nonetheless, our results indicate that under difficult conditions, Mutect2 in multisample mode, in combination with a correction step, seems to perform best. Our analysis provides data-driven guidance for users and developers of somatic variant calling tools.


Sign in / Sign up

Export Citation Format

Share Document