Bivartect: accurate and memory-saving breakpoint detection by direct read comparison

Keisuke Shimmura; Yuki Kato; Yukio Kawahara

doi:10.1093/bioinformatics/btaa059

Bivartect: accurate and memory-saving breakpoint detection by direct read comparison

Bioinformatics ◽

10.1093/bioinformatics/btaa059 ◽

2020 ◽

Vol 36 (9) ◽

pp. 2725-2730

Author(s):

Keisuke Shimmura ◽

Yuki Kato ◽

Yukio Kawahara

Keyword(s):

Genome Editing ◽

High Throughput Sequencing ◽

Variant Calling ◽

Simulated Data ◽

Supplementary Information ◽

Sequencing Data ◽

Single Nucleotide Variants ◽

Single Node ◽

Single Nucleotide ◽

Target Sites

Abstract Motivation Genetic variant calling with high-throughput sequencing data has been recognized as a useful tool for better understanding of disease mechanism and detection of potential off-target sites in genome editing. Since most of the variant calling algorithms rely on initial mapping onto a reference genome and tend to predict many variant candidates, variant calling remains challenging in terms of predicting variants with low false positives. Results Here we present Bivartect, a simple yet versatile variant caller based on direct comparison of short sequence reads between normal and mutated samples. Bivartect can detect not only single nucleotide variants but also insertions/deletions, inversions and their complexes. Bivartect achieves high predictive performance with an elaborate memory-saving mechanism, which allows Bivartect to run on a computer with a single node for analyzing small omics data. Tests with simulated benchmark and real genome-editing data indicate that Bivartect was comparable to state-of-the-art variant callers in positive predictive value for detection of single nucleotide variants, even though it yielded a substantially small number of candidates. These results suggest that Bivartect, a reference-free approach, will contribute to the identification of germline mutations as well as off-target sites introduced during genome editing with high accuracy. Availability and implementation Bivartect is implemented in C++ and available along with in silico simulated data at https://github.com/ykat0/bivartect. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

SomatoSim: precision simulation of somatic single nucleotide variants

BMC Bioinformatics ◽

10.1186/s12859-021-04024-8 ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Marwan A. Hawari ◽

Celine S. Hong ◽

Leslie G. Biesecker

Keyword(s):

High Throughput Sequencing ◽

Variant Calling ◽

Simulated Data ◽

Sequencing Data ◽

Single Nucleotide Variants ◽

Single Nucleotide ◽

Somatic Variant ◽

Simulation Tools ◽

Gold Standard Dataset ◽

High Level

Abstract Background Somatic single nucleotide variants have gained increased attention because of their role in cancer development and the widespread use of high-throughput sequencing techniques. The necessity to accurately identify these variants in sequencing data has led to a proliferation of somatic variant calling tools. Additionally, the use of simulated data to assess the performance of these tools has become common practice, as there is no gold standard dataset for benchmarking performance. However, many existing somatic variant simulation tools are limited because they rely on generating entirely synthetic reads derived from a reference genome or because they do not allow for the precise customizability that would enable a more focused understanding of single nucleotide variant calling performance. Results SomatoSim is a tool that lets users simulate somatic single nucleotide variants in sequence alignment map (SAM/BAM) files with full control of the specific variant positions, number of variants, variant allele fractions, depth of coverage, read quality, and base quality, among other parameters. SomatoSim accomplishes this through a three-stage process: variant selection, where candidate positions are selected for simulation, variant simulation, where reads are selected and mutated, and variant evaluation, where SomatoSim summarizes the simulation results. Conclusions SomatoSim is a user-friendly tool that offers a high level of customizability for simulating somatic single nucleotide variants. SomatoSim is available at https://github.com/BieseckerLab/SomatoSim.

Download Full-text

neoepiscope improves neoepitope prediction with multivariant phasing

Bioinformatics ◽

10.1093/bioinformatics/btz653 ◽

2019 ◽

Vol 36 (3) ◽

pp. 713-720 ◽

Cited By ~ 5

Author(s):

Mary A Wood ◽

Austin Nguyen ◽

Adam J Struck ◽

Kyle Ellrott ◽

Abhinav Nellore ◽

...

Keyword(s):

False Negative ◽

Supplementary Information ◽

Supplementary File ◽

Sequencing Data ◽

Single Nucleotide Variants ◽

Single Nucleotide ◽

Somatic Variant ◽

Negative Results ◽

Multiple Datasets ◽

False Negative Results

Abstract Motivation The vast majority of tools for neoepitope prediction from DNA sequencing of complementary tumor and normal patient samples do not consider germline context or the potential for the co-occurrence of two or more somatic variants on the same mRNA transcript. Without consideration of these phenomena, existing approaches are likely to produce both false-positive and false-negative results, resulting in an inaccurate and incomplete picture of the cancer neoepitope landscape. We developed neoepiscope chiefly to address this issue for single nucleotide variants (SNVs) and insertions/deletions (indels). Results Herein, we illustrate how germline and somatic variant phasing affects neoepitope prediction across multiple datasets. We estimate that up to ∼5% of neoepitopes arising from SNVs and indels may require variant phasing for their accurate assessment. neoepiscope is performant, flexible and supports several major histocompatibility complex binding affinity prediction tools. Availability and implementation neoepiscope is available on GitHub at https://github.com/pdxgx/neoepiscope under the MIT license. Scripts for reproducing results described in the text are available at https://github.com/pdxgx/neoepiscope-paper under the MIT license. Additional data from this study, including summaries of variant phasing incidence and benchmarking wallclock times, are available in Supplementary Files 1, 2 and 3. Supplementary File 1 contains Supplementary Table 1, Supplementary Figures 1 and 2, and descriptions of Supplementary Tables 2–8. Supplementary File 2 contains Supplementary Tables 2–6 and 8. Supplementary File 3 contains Supplementary Table 7. Raw sequencing data used for the analyses in this manuscript are available from the Sequence Read Archive under accessions PRJNA278450, PRJNA312948, PRJNA307199, PRJNA343789, PRJNA357321, PRJNA293912, PRJNA369259, PRJNA305077, PRJNA306070, PRJNA82745 and PRJNA324705; from the European Genome-phenome Archive under accessions EGAD00001004352 and EGAD00001002731; and by direct request to the authors. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

The discrepancy among single nucleotide variants detected by DNA and RNA high throughput sequencing data

BMC Genomics ◽

10.1186/s12864-017-4022-x ◽

2017 ◽

Vol 18 (S6) ◽

Cited By ~ 16

Author(s):

Yan Guo ◽

Shilin Zhao ◽

Quanhu Sheng ◽

David C Samuels ◽

Yu Shyr

Keyword(s):

High Throughput ◽

High Throughput Sequencing ◽

Sequencing Data ◽

Single Nucleotide Variants ◽

Single Nucleotide ◽

Dna And Rna ◽

High Throughput Sequencing Data

Download Full-text

Misannotation of multiple-nucleotide variants risks misdiagnosis

Wellcome Open Research ◽

10.12688/wellcomeopenres.15420.1 ◽

2019 ◽

Vol 4 ◽

pp. 145

Author(s):

Matthew N. Wakeling ◽

Thomas W. Laver ◽

Kevin Colclough ◽

Andrew Parish ◽

Sian Ellard ◽

...

Keyword(s):

Best Practices ◽

False Negative ◽

Simulated Data ◽

Sequencing Analysis ◽

Sequencing Data ◽

Single Nucleotide Variants ◽

Single Nucleotide ◽

Public Resources ◽

Next Generation Sequencing Analysis ◽

Optimal Approach

Multiple Nucleotide Variants (MNVs) are miscalled by the most widely utilised next generation sequencing analysis (NGS) pipelines, presenting the potential for missing diagnoses that would previously have been made by standard Sanger (dideoxy) sequencing. These variants, which should be treated as a single insertion-deletion mutation event, are commonly called as separate single nucleotide variants. This can result in misannotation, incorrect amino acid predictions and potentially false positive and false negative diagnostic results. This risk will be increased as confirmatory Sanger sequencing of Single Nucleotide variants (SNVs) ceases to be standard practice. Using simulated data and re-analysis of sequencing data from a diagnostic targeted gene panel, we demonstrate that the widely adopted pipeline, GATK best practices, results in miscalling of MNVs and that alternative tools can call these variants correctly. The adoption of calling methods that annotate MNVs correctly would present a solution for individual laboratories, however GATK best practices are the basis for important public resources such as the gnomAD database. We suggest integrating a solution into these guidelines would be the optimal approach.

Download Full-text

NGSEP3: accurate variant calling across species and sequencing protocols

Bioinformatics ◽

10.1093/bioinformatics/btz275 ◽

2019 ◽

Vol 35 (22) ◽

pp. 4716-4723 ◽

Cited By ~ 7

Author(s):

Daniel Tello ◽

Juanita Gil ◽

Cristian D Loaiza ◽

John J Riascos ◽

Nicolás Cardozo ◽

...

Keyword(s):

Short Tandem Repeats ◽

Tandem Repeats ◽

High Throughput Sequencing ◽

Variant Calling ◽

Real Data ◽

Supplementary Information ◽

Sequencing Data ◽

Comparative Accuracy ◽

Downstream Analysis ◽

Short Tandem

Abstract Motivation Accurate detection, genotyping and downstream analysis of genomic variants from high-throughput sequencing data are fundamental features in modern production pipelines for genetic-based diagnosis in medicine or genomic selection in plant and animal breeding. Our research group maintains the Next-Generation Sequencing Experience Platform (NGSEP) as a precise, efficient and easy-to-use software solution for these features. Results Understanding that incorrect alignments around short tandem repeats are an important source of genotyping errors, we implemented in NGSEP new algorithms for realignment and haplotype clustering of reads spanning indels and short tandem repeats. We performed extensive benchmark experiments comparing NGSEP to state-of-the-art software using real data from three sequencing protocols and four species with different distributions of repetitive elements. NGSEP consistently shows comparative accuracy and better efficiency compared to the existing solutions. We expect that this work will contribute to the continuous improvement of quality in variant calling needed for modern applications in medicine and agriculture. Availability and implementation NGSEP is available as open source software at http://ngsep.sf.net. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

xAtlas: Scalable small variant calling across heterogeneous next-generation sequencing experiments

10.1101/295071 ◽

2018 ◽

Cited By ~ 7

Author(s):

Jesse Farek ◽

Daniel Hughes ◽

Adam Mansfield ◽

Olga Krasheninina ◽

Waleed Nasser ◽

...

Keyword(s):

Next Generation Sequencing ◽

Rapid Development ◽

Variant Calling ◽

Supplementary Information ◽

Data Generation ◽

Next Generation ◽

Sequencing Data ◽

Single Nucleotide Variants ◽

Homogeneous Sample ◽

Generation Sequencing

AbstractMotivationThe rapid development of next-generation sequencing (NGS) technologies has lowered the barriers to genomic data generation, resulting in millions of samples sequenced across diverse experimental designs. The growing volume and heterogeneity of these sequencing data complicate the further optimization of methods for identifying DNA variation, especially considering that curated highconfidence variant call sets commonly used to evaluate these methods are generally developed by reference to results from the analysis of comparatively small and homogeneous sample sets.ResultsWe have developed xAtlas, an application for the identification of single nucleotide variants (SNV) and small insertions and deletions (indels) in NGS data. xAtlas is easily scalable and enables execution and retraining with rapid development cycles. Generation of variant calls in VCF or gVCF format from BAM or CRAM alignments is accomplished in less than one CPU-hour per 30× short-read human whole-genome. The retraining capabilities of xAtlas allow its core variant evaluation models to be optimized on new sample data and user-defined truth sets. Obtaining SNV and indels calls from xAtlas can be achieved more than 40 times faster than established methods while retaining the same accuracy.AvailabilityFreely available under a BSD 3-clause license at https://github.com/jfarek/[email protected] informationSupplementary data are available at Bioinformatics online.

Download Full-text

Comparison of single-nucleotide variants identified by Illumina and Oxford Nanopore technologies in the context of a potential outbreak of Shiga toxin–producing Escherichia coli

GigaScience ◽

10.1093/gigascience/giz104 ◽

2019 ◽

Vol 8 (8) ◽

Cited By ~ 8

Author(s):

David R Greig ◽

Claire Jenkins ◽

Saheer Gharbia ◽

Timothy J Dallman

Keyword(s):

Escherichia Coli ◽

Shiga Toxin ◽

Variant Calling ◽

Reference Database ◽

Sequencing Data ◽

Single Nucleotide Variants ◽

Single Nucleotide ◽

Oxford Nanopore ◽

Variant Filtering ◽

Oxford Nanopore Technologies

Abstract Background We aimed to compare Illumina and Oxford Nanopore Technology sequencing data from the 2 isolates of Shiga toxin–producing Escherichia coli (STEC) O157:H7 to determine whether concordant single-nucleotide variants were identified and whether inference of relatedness was consistent with the 2 technologies. Results For the Illumina workflow, the time from DNA extraction to availability of results was ∼40 hours, whereas with the ONT workflow serotyping and Shiga toxin subtyping variant identification were available within 7 hours. After optimization of the ONT variant filtering, on average 95% of the discrepant positions between the technologies were accounted for by methylated positions found in the described 5-methylcytosine motif sequences, CC(A/T)GG. Of the few discrepant variants (6 and 7 difference for the 2 isolates) identified by the 2 technologies, it is likely that both methodologies contain false calls. Conclusions Despite these discrepancies, Illumina and Oxford Nanopore Technology sequences from the same case were placed on the same phylogenetic location against a dense reference database of STEC O157:H7 genomes sequenced using the Illumina workflow. Robust single-nucleotide polymorphism typing using MinION-based variant calling is possible, and we provide evidence that the 2 technologies can be used interchangeably to type STEC O157:H7 in a public health setting.

Download Full-text

dv-trio: a family-based variant calling pipeline using DeepVariant

Bioinformatics ◽

10.1093/bioinformatics/btaa116 ◽

2020 ◽

Vol 36 (11) ◽

pp. 3549-3551 ◽

Cited By ~ 1

Author(s):

Eddie K K Ip ◽

Clinton Hadinata ◽

Joshua W K Ho ◽

Eleni Giannoulatou

Keyword(s):

Genetic Model ◽

Variant Calling ◽

Supplementary Information ◽

Next Generation Sequencing Data ◽

Sequencing Data ◽

Single Nucleotide Variants ◽

Mutation Discovery ◽

Family Trio ◽

Sequencing Studies ◽

Family Based

Abstract Motivation In 2018, Google published an innovative variant caller, DeepVariant, which converts pileups of sequence reads into images and uses a deep neural network to identify single-nucleotide variants and small insertion/deletions from next-generation sequencing data. This approach outperforms existing state-of-the-art tools. However, DeepVariant was designed to call variants within a single sample. In disease sequencing studies, the ability to examine a family trio (father-mother-affected child) provides greater power for disease mutation discovery. Results To further improve DeepVariant’s variant calling accuracy in family-based sequencing studies, we have developed a family-based variant calling pipeline, dv-trio, which incorporates the trio information from the Mendelian genetic model into variant calling based on DeepVariant. Availability and implementation dv-trio is available via an open source BSD3 license at GitHub (https://github.com/VCCRI/dv-trio/). Contact [email protected] Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

EdiTyper: a high-throughput tool for analysis of targeted sequencing data from genome editing experiments

10.1101/2020.07.30.229088 ◽

2020 ◽

Cited By ~ 2

Author(s):

Alexandre Yahi ◽

Paul Hoffman ◽

Margot Brandt ◽

Pejman Mohammadi ◽

Nicholas P. Tatonetti ◽

...

Keyword(s):

Genome Editing ◽

High Throughput ◽

Gene Editing ◽

Simulated Data ◽

Targeted Sequencing ◽

Sequencing Data ◽

Single Nucleotide ◽

Sequencing Errors ◽

Command Line Tool ◽

Clonal Cell Lines

AbstractGenome editing experiments are generating an increasing amount of targeted sequencing data with specific mutational patterns indicating the success of the experiments and genotypes of clonal cell lines. We present EdiTyper, a high-throughput command line tool specifically designed for analysis of sequencing data from polyclonal and monoclonal cell populations from CRISPR gene editing. It requires simple inputs of sequencing data and reference sequences, and provides comprehensive outputs including summary statistics, plots, and SAM/BAM alignments. Analysis of simulated data showed that EdiTyper is highly accurate for detection of both single nucleotide mutations and indels, robust to sequencing errors, as well as fast and scalable to large experimental batches. EdiTyper is available in github (https://github.com/LappalainenLab/edityper) under the MIT license.

Download Full-text

Accurate and scalable variant calling from single cell DNA sequencing data with ProSolo

Nature Communications ◽

10.1038/s41467-021-26938-w ◽

2021 ◽

Vol 12 (1) ◽

Author(s):

David Lähnemann ◽

Johannes Köster ◽

Ute Fischer ◽

Arndt Borkhardt ◽

Alice C. McHardy ◽

...

Keyword(s):

Dna Sequencing ◽

Single Cell ◽

Single Cells ◽

Variant Calling ◽

Sequencing Data ◽

Computationally Efficient ◽

Single Nucleotide Variants ◽

Efficient Manner ◽

Single Nucleotide ◽

Amplification Bias

AbstractAccurate single cell mutational profiles can reveal genomic cell-to-cell heterogeneity. However, sequencing libraries suitable for genotyping require whole genome amplification, which introduces allelic bias and copy errors. The resulting data violates assumptions of variant callers developed for bulk sequencing. Thus, only dedicated models accounting for amplification bias and errors can provide accurate calls. We present ProSolo for calling single nucleotide variants from multiple displacement amplified (MDA) single cell DNA sequencing data. ProSolo probabilistically models a single cell jointly with a bulk sequencing sample and integrates all relevant MDA biases in a site-specific and scalable—because computationally efficient—manner. This achieves a higher accuracy in calling and genotyping single nucleotide variants in single cells in comparison to state-of-the-art tools and supports imputation of insufficiently covered genotypes, when downstream tools cannot handle missing data. Moreover, ProSolo implements the first approach to control the false discovery rate reliably and flexibly. ProSolo is implemented in an extendable framework, with code and usage at: https://github.com/prosolo/prosolo

Download Full-text