UMI-VarCal: a new UMI-based variant caller that efficiently improves low-frequency variant detection in paired-end sequencing NGS libraries

Vincent Sater; Pierre-Julien Viailly; Thierry Lecroq; Élise Prieur-Gaston; Élodie Bohers; Mathieu Viennot; Philippe Ruminy; Hélène Dauchel; Pierre Vera; Fabrice Jardin

doi:10.1093/bioinformatics/btaa053

UMI-VarCal: a new UMI-based variant caller that efficiently improves low-frequency variant detection in paired-end sequencing NGS libraries

Bioinformatics ◽

10.1093/bioinformatics/btaa053 ◽

2020 ◽

Vol 36 (9) ◽

pp. 2718-2724 ◽

Cited By ~ 5

Author(s):

Vincent Sater ◽

Pierre-Julien Viailly ◽

Thierry Lecroq ◽

Élise Prieur-Gaston ◽

Élodie Bohers ◽

...

Keyword(s):

Tumor Cells ◽

Low Frequency ◽

Variant Calling ◽

Pcr Amplification ◽

Targeted Sequencing ◽

Supplementary Information ◽

Sequencing Data ◽

Single Nucleotide Variants ◽

Background Error ◽

Low Frequencies

Abstract Motivation Next-generation sequencing has become the go-to standard method for the detection of single-nucleotide variants in tumor cells. The use of such technologies requires a PCR amplification step and a sequencing step, steps in which artifacts are introduced at very low frequencies. These artifacts are often confused with true low-frequency variants that can be found in tumor cells and cell-free DNA. The recent use of unique molecular identifiers (UMI) in targeted sequencing protocols has offered a trustworthy approach to filter out artefactual variants and accurately call low-frequency variants. However, the integration of UMI analysis in the variant calling process led to developing tools that are significantly slower and more memory consuming than raw-reads-based variant callers. Results We present UMI-VarCal, a UMI-based variant caller for targeted sequencing data with better sensitivity compared to other variant callers. Being developed with performance in mind, UMI-VarCal stands out from the crowd by being one of the few variant callers that do not rely on SAMtools to do their pileup. Instead, at its core runs an innovative homemade pileup algorithm specifically designed to treat the UMI tags in the reads. After the pileup, a Poisson statistical test is applied at every position to determine if the frequency of the variant is significantly higher than the background error noise. Finally, an analysis of UMI tags is performed, a strand bias and a homopolymer length filter are applied to achieve better accuracy. We illustrate the results obtained using UMI-VarCal through the sequencing of tumor samples and we show how UMI-VarCal is both faster and more sensitive than other publicly available solutions. Availability and implementation The entire pipeline is available at https://gitlab.com/vincent-sater/umi-varcal-master under MIT license. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

smCounter2: an accurate low-frequency variant caller for targeted sequencing data with unique molecular identifiers

10.1101/281659 ◽

2018 ◽

Cited By ~ 4

Author(s):

Chang Xu ◽

Xiujing Gu ◽

Raghavendra Padmanabhan ◽

Zhong Wu ◽

Quan Peng ◽

...

Keyword(s):

Low Frequency ◽

Variant Calling ◽

Targeted Sequencing ◽

Superior Performance ◽

Sequencing Data ◽

Background Error ◽

Fundamental Limits ◽

Sequencing Errors ◽

Coding Regions ◽

Improved Accuracy

AbstractMotivationLow-frequency DNA mutations are often confounded with technical artifacts from sample preparation and sequencing. With unique molecular identifiers (UMIs), most of the sequencing errors can be corrected. However, errors before UMI tagging, such as DNA polymerase errors during end-repair and the first PCR cycle, cannot be corrected with single-strand UMIs and impose fundamental limits to UMI-based variant calling.ResultsWe developed smCounter2, a UMI-based variant caller for targeted sequencing data and an upgrade from the current version of smCounter. Compared to smCounter, smCounter2 features lower detection limit at 0.5%, better overall accuracy (particularly in non-coding regions), a consistent threshold that can be applied to both deep and shallow sequencing runs, and easier use via a Docker image and code for read pre-processing. We benchmarked smCounter2 against several state-of-the-art UMI-based variant calling methods using multiple datasets and demonstrated smCounter2’s superior performance in detecting somatic variants. At the core of smCounter2 is a statistical test to determine whether the allele frequency of the putative variant is significantly above the background error rate, which was carefully modeled using an independent dataset. The improved accuracy in non-coding regions was mainly achieved using novel repetitive region filters that were specifically designed for UMI data.AvailabilityThe entire pipeline is available at https://github.com/qiaseq/qiaseq-dna under MIT license.

Download Full-text

xAtlas: Scalable small variant calling across heterogeneous next-generation sequencing experiments

10.1101/295071 ◽

2018 ◽

Cited By ~ 7

Author(s):

Jesse Farek ◽

Daniel Hughes ◽

Adam Mansfield ◽

Olga Krasheninina ◽

Waleed Nasser ◽

...

Keyword(s):

Next Generation Sequencing ◽

Rapid Development ◽

Variant Calling ◽

Supplementary Information ◽

Data Generation ◽

Next Generation ◽

Sequencing Data ◽

Single Nucleotide Variants ◽

Homogeneous Sample ◽

Generation Sequencing

AbstractMotivationThe rapid development of next-generation sequencing (NGS) technologies has lowered the barriers to genomic data generation, resulting in millions of samples sequenced across diverse experimental designs. The growing volume and heterogeneity of these sequencing data complicate the further optimization of methods for identifying DNA variation, especially considering that curated highconfidence variant call sets commonly used to evaluate these methods are generally developed by reference to results from the analysis of comparatively small and homogeneous sample sets.ResultsWe have developed xAtlas, an application for the identification of single nucleotide variants (SNV) and small insertions and deletions (indels) in NGS data. xAtlas is easily scalable and enables execution and retraining with rapid development cycles. Generation of variant calls in VCF or gVCF format from BAM or CRAM alignments is accomplished in less than one CPU-hour per 30× short-read human whole-genome. The retraining capabilities of xAtlas allow its core variant evaluation models to be optimized on new sample data and user-defined truth sets. Obtaining SNV and indels calls from xAtlas can be achieved more than 40 times faster than established methods while retaining the same accuracy.AvailabilityFreely available under a BSD 3-clause license at https://github.com/jfarek/[email protected] informationSupplementary data are available at Bioinformatics online.

Download Full-text

UMI-Gen: a UMI-based reads simulator for variant calling evaluation in paired-end sequencing NGS libraries

10.1101/2020.04.22.027532 ◽

2020 ◽

Author(s):

Vincent Sater ◽

Pierre-Julien Viailly ◽

Thierry Lecroq ◽

Philippe Ruminy ◽

Caroline Bérard ◽

...

Keyword(s):

Variant Calling ◽

Copy Number Variations ◽

Biological Data ◽

Single Nucleotide Variants ◽

Background Error ◽

Single Nucleotide ◽

Low Frequencies ◽

Paired End Sequencing ◽

Very High ◽

Generation Sequencing

AbstractMotivationWith Next Generation Sequencing becoming more affordable every year, NGS technologies asserted themselves as the fastest and most reliable way to detect Single Nucleotide Variants (SNV) and Copy Number Variations (CNV) in cancer patients. These technologies can be used to sequence DNA at very high depths thus allowing to detect abnormalities in tumor cells with very low frequencies. A lot of different variant callers are publicly available and usually do a good job at calling out variants. However, when frequencies begin to drop under 1%, the specificity of these tools suffers greatly as true variants at very low frequencies can be easily confused with sequencing or PCR artifacts. The recent use of Unique Molecular Identifiers (UMI) in NGS experiments offered a way to accurately separate true variants from artifacts. UMI-based variant callers are slowly replacing raw-reads based variant callers as the standard method for an accurate detection of variants at very low frequencies. However, benchmarking done in the tools publication are usually realized on real biological data in which real variants are not known, making it difficult to assess their accuracy.ResultsWe present UMI-Gen, a UMI-based reads simulator for targeted sequencing paired-end data. UMI-Gen generates reference reads covering the targeted regions at a user customizable depth. After that, using a number of control files, it estimates the background error rate at each position and then modifies the generated reads to mimic real biological data. Finally, it will insert real variants in the reads from a list provided by the user.AvailabilityThe entire pipeline is available at https://gitlab.com/vincent-sater/umigen-master under MIT [email protected]

Download Full-text

Bivartect: accurate and memory-saving breakpoint detection by direct read comparison

Bioinformatics ◽

10.1093/bioinformatics/btaa059 ◽

2020 ◽

Vol 36 (9) ◽

pp. 2725-2730

Author(s):

Keisuke Shimmura ◽

Yuki Kato ◽

Yukio Kawahara

Keyword(s):

Genome Editing ◽

High Throughput Sequencing ◽

Variant Calling ◽

Simulated Data ◽

Supplementary Information ◽

Sequencing Data ◽

Single Nucleotide Variants ◽

Single Node ◽

Single Nucleotide ◽

Target Sites

Abstract Motivation Genetic variant calling with high-throughput sequencing data has been recognized as a useful tool for better understanding of disease mechanism and detection of potential off-target sites in genome editing. Since most of the variant calling algorithms rely on initial mapping onto a reference genome and tend to predict many variant candidates, variant calling remains challenging in terms of predicting variants with low false positives. Results Here we present Bivartect, a simple yet versatile variant caller based on direct comparison of short sequence reads between normal and mutated samples. Bivartect can detect not only single nucleotide variants but also insertions/deletions, inversions and their complexes. Bivartect achieves high predictive performance with an elaborate memory-saving mechanism, which allows Bivartect to run on a computer with a single node for analyzing small omics data. Tests with simulated benchmark and real genome-editing data indicate that Bivartect was comparable to state-of-the-art variant callers in positive predictive value for detection of single nucleotide variants, even though it yielded a substantially small number of candidates. These results suggest that Bivartect, a reference-free approach, will contribute to the identification of germline mutations as well as off-target sites introduced during genome editing with high accuracy. Availability and implementation Bivartect is implemented in C++ and available along with in silico simulated data at https://github.com/ykat0/bivartect. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

dv-trio: a family-based variant calling pipeline using DeepVariant

Bioinformatics ◽

10.1093/bioinformatics/btaa116 ◽

2020 ◽

Vol 36 (11) ◽

pp. 3549-3551 ◽

Cited By ~ 1

Author(s):

Eddie K K Ip ◽

Clinton Hadinata ◽

Joshua W K Ho ◽

Eleni Giannoulatou

Keyword(s):

Genetic Model ◽

Variant Calling ◽

Supplementary Information ◽

Next Generation Sequencing Data ◽

Sequencing Data ◽

Single Nucleotide Variants ◽

Mutation Discovery ◽

Family Trio ◽

Sequencing Studies ◽

Family Based

Abstract Motivation In 2018, Google published an innovative variant caller, DeepVariant, which converts pileups of sequence reads into images and uses a deep neural network to identify single-nucleotide variants and small insertion/deletions from next-generation sequencing data. This approach outperforms existing state-of-the-art tools. However, DeepVariant was designed to call variants within a single sample. In disease sequencing studies, the ability to examine a family trio (father-mother-affected child) provides greater power for disease mutation discovery. Results To further improve DeepVariant’s variant calling accuracy in family-based sequencing studies, we have developed a family-based variant calling pipeline, dv-trio, which incorporates the trio information from the Mendelian genetic model into variant calling based on DeepVariant. Availability and implementation dv-trio is available via an open source BSD3 license at GitHub (https://github.com/VCCRI/dv-trio/). Contact [email protected] Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Moss enables high sensitivity single-nucleotide variant calling from multiple bulk DNA tumor samples

Nature Communications ◽

10.1038/s41467-021-22466-9 ◽

2021 ◽

Vol 12 (1) ◽

Author(s):

Chuanyi Zhang ◽

Mohammed El-Kebir ◽

Idoia Ochoa

Keyword(s):

Cancer Genomics ◽

Low Frequency ◽

Variant Calling ◽

High Sensitivity ◽

Single Sample ◽

Sequencing Data ◽

Single Nucleotide Variants ◽

Additional Time ◽

Single Nucleotide ◽

Multiple Samples

AbstractIntra-tumor heterogeneity renders the identification of somatic single-nucleotide variants (SNVs) a challenging problem. In particular, low-frequency SNVs are hard to distinguish from sequencing artifacts. While the increasing availability of multi-sample tumor DNA sequencing data holds the potential for more accurate variant calling, there is a lack of high-sensitivity multi-sample SNV callers that utilize these data. Here we report Moss, a method to identify low-frequency SNVs that recur in multiple sequencing samples from the same tumor. Moss provides any existing single-sample SNV caller the ability to support multiple samples with little additional time overhead. We demonstrate that Moss improves recall while maintaining high precision in a simulated dataset. On multi-sample hepatocellular carcinoma, acute myeloid leukemia and colorectal cancer datasets, Moss identifies new low-frequency variants that meet manual review criteria and are consistent with the tumor’s mutational signature profile. In addition, Moss detects the presence of variants in more samples of the same tumor than reported by the single-sample caller. Moss’ improved sensitivity in SNV calling will enable more detailed downstream analyses in cancer genomics.

Download Full-text

Increased yields of duplex sequencing data by a series of quality control tools

NAR Genomics and Bioinformatics ◽

10.1093/nargab/lqab002 ◽

2021 ◽

Vol 3 (1) ◽

Author(s):

Gundula Povysil ◽

Monika Heinzl ◽

Renato Salazar ◽

Nicholas Stoler ◽

Anton Nekrutenko ◽

...

Keyword(s):

Low Frequency ◽

Variant Calling ◽

Data Loss ◽

Sequencing Data ◽

Bioinformatics Pipeline ◽

Consensus Sequences ◽

Sequencing Errors ◽

Data Output ◽

Reverse Strand ◽

Duplex Sequencing

Abstract Duplex sequencing is currently the most reliable method to identify ultra-low frequency DNA variants by grouping sequence reads derived from the same DNA molecule into families with information on the forward and reverse strand. However, only a small proportion of reads are assembled into duplex consensus sequences (DCS), and reads with potentially valuable information are discarded at different steps of the bioinformatics pipeline, especially reads without a family. We developed a bioinformatics toolset that analyses the tag and family composition with the purpose to understand data loss and implement modifications to maximize the data output for the variant calling. Specifically, our tools show that tags contain polymerase chain reaction and sequencing errors that contribute to data loss and lower DCS yields. Our tools also identified chimeras, which likely reflect barcode collisions. Finally, we also developed a tool that re-examines variant calls from raw reads and provides different summary data that categorizes the confidence level of a variant call by a tier-based system. With this tool, we can include reads without a family and check the reliability of the call, that increases substantially the sequencing depth for variant calling, a particular important advantage for low-input samples or low-coverage regions.

Download Full-text

Numt identification and removal with RtN!

Bioinformatics ◽

10.1093/bioinformatics/btaa642 ◽

2020 ◽

Vol 36 (20) ◽

pp. 5115-5116 ◽

Cited By ~ 2

Author(s):

August E Woerner ◽

Jennifer Churchill Cihlar ◽

Utpal Smart ◽

Bruce Budowle

Keyword(s):

Mitochondrial Genome ◽

Massively Parallel Sequencing ◽

Sequence Similarity ◽

Variant Calling ◽

Supplementary Information ◽

Mitochondrial Genomes ◽

Sequencing Data ◽

Read Mapping ◽

Genome Data ◽

Mitochondrial Sequences

Abstract Motivation Assays in mitochondrial genomics rely on accurate read mapping and variant calling. However, there are known and unknown nuclear paralogs that have fundamentally different genetic properties than that of the mitochondrial genome. Such paralogs complicate the interpretation of mitochondrial genome data and confound variant calling. Results Remove the Numts! (RtN!) was developed to categorize reads from massively parallel sequencing data not based on the expected properties and sequence identities of paralogous nuclear encoded mitochondrial sequences, but instead using sequence similarity to a large database of publicly available mitochondrial genomes. RtN! removes low-level sequencing noise and mitochondrial paralogs while not impacting variant calling, while competing methods were shown to remove true variants from mitochondrial mixtures. Availability and implementation https://github.com/Ahhgust/RtN Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

neoepiscope improves neoepitope prediction with multivariant phasing

Bioinformatics ◽

10.1093/bioinformatics/btz653 ◽

2019 ◽

Vol 36 (3) ◽

pp. 713-720 ◽

Cited By ~ 5

Author(s):

Mary A Wood ◽

Austin Nguyen ◽

Adam J Struck ◽

Kyle Ellrott ◽

Abhinav Nellore ◽

...

Keyword(s):

False Negative ◽

Supplementary Information ◽

Supplementary File ◽

Sequencing Data ◽

Single Nucleotide Variants ◽

Single Nucleotide ◽

Somatic Variant ◽

Negative Results ◽

Multiple Datasets ◽

False Negative Results

Abstract Motivation The vast majority of tools for neoepitope prediction from DNA sequencing of complementary tumor and normal patient samples do not consider germline context or the potential for the co-occurrence of two or more somatic variants on the same mRNA transcript. Without consideration of these phenomena, existing approaches are likely to produce both false-positive and false-negative results, resulting in an inaccurate and incomplete picture of the cancer neoepitope landscape. We developed neoepiscope chiefly to address this issue for single nucleotide variants (SNVs) and insertions/deletions (indels). Results Herein, we illustrate how germline and somatic variant phasing affects neoepitope prediction across multiple datasets. We estimate that up to ∼5% of neoepitopes arising from SNVs and indels may require variant phasing for their accurate assessment. neoepiscope is performant, flexible and supports several major histocompatibility complex binding affinity prediction tools. Availability and implementation neoepiscope is available on GitHub at https://github.com/pdxgx/neoepiscope under the MIT license. Scripts for reproducing results described in the text are available at https://github.com/pdxgx/neoepiscope-paper under the MIT license. Additional data from this study, including summaries of variant phasing incidence and benchmarking wallclock times, are available in Supplementary Files 1, 2 and 3. Supplementary File 1 contains Supplementary Table 1, Supplementary Figures 1 and 2, and descriptions of Supplementary Tables 2–8. Supplementary File 2 contains Supplementary Tables 2–6 and 8. Supplementary File 3 contains Supplementary Table 7. Raw sequencing data used for the analyses in this manuscript are available from the Sequence Read Archive under accessions PRJNA278450, PRJNA312948, PRJNA307199, PRJNA343789, PRJNA357321, PRJNA293912, PRJNA369259, PRJNA305077, PRJNA306070, PRJNA82745 and PRJNA324705; from the European Genome-phenome Archive under accessions EGAD00001004352 and EGAD00001002731; and by direct request to the authors. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

ABEMUS: platform-specific and data-informed detection of somatic SNVs in cfDNA

Bioinformatics ◽

10.1093/bioinformatics/btaa016 ◽

2020 ◽

Vol 36 (9) ◽

pp. 2665-2674

Author(s):

Nicola Casiraghi ◽

Francesco Orlando ◽

Yari Ciani ◽

Jenny Xiang ◽

Andrea Sboner ◽

...

Keyword(s):

Cancer Patients ◽

R Package ◽

Circulating Tumor Dna ◽

Supplementary Information ◽

Sequencing Error ◽

Sequencing Data ◽

Single Nucleotide Variants ◽

Liquid Biopsies ◽

Non Invasive ◽

Cross Platform

Abstract Motivation The use of liquid biopsies for cancer patients enables the non-invasive tracking of treatment response and tumor dynamics through single or serial blood drawn tests. Next-generation sequencing assays allow for the simultaneous interrogation of extended sets of somatic single-nucleotide variants (SNVs) in circulating cell-free DNA (cfDNA), a mixture of DNA molecules originating both from normal and tumor tissue cells. However, low circulating tumor DNA (ctDNA) fractions together with sequencing background noise and potential tumor heterogeneity challenge the ability to confidently call SNVs. Results We present a computational methodology, called Adaptive Base Error Model in Ultra-deep Sequencing data (ABEMUS), which combines platform-specific genetic knowledge and empirical signal to readily detect and quantify somatic SNVs in cfDNA. We tested the capability of our method to analyze data generated using different platforms with distinct sequencing error properties and we compared ABEMUS performances with other popular SNV callers on both synthetic and real cancer patients sequencing data. Results show that ABEMUS performs better in most of the tested conditions proving its reliability in calling low variant allele frequencies somatic SNVs in low ctDNA levels plasma samples. Availability and implementation ABEMUS is cross-platform and can be installed as R package. The source code is maintained on Github at http://github.com/cibiobcg/abemus, and it is also available at CRAN official R repository. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text