scholarly journals smCounter2: an accurate low-frequency variant caller for targeted sequencing data with unique molecular identifiers

2018 ◽  
Author(s):  
Chang Xu ◽  
Xiujing Gu ◽  
Raghavendra Padmanabhan ◽  
Zhong Wu ◽  
Quan Peng ◽  
...  

AbstractMotivationLow-frequency DNA mutations are often confounded with technical artifacts from sample preparation and sequencing. With unique molecular identifiers (UMIs), most of the sequencing errors can be corrected. However, errors before UMI tagging, such as DNA polymerase errors during end-repair and the first PCR cycle, cannot be corrected with single-strand UMIs and impose fundamental limits to UMI-based variant calling.ResultsWe developed smCounter2, a UMI-based variant caller for targeted sequencing data and an upgrade from the current version of smCounter. Compared to smCounter, smCounter2 features lower detection limit at 0.5%, better overall accuracy (particularly in non-coding regions), a consistent threshold that can be applied to both deep and shallow sequencing runs, and easier use via a Docker image and code for read pre-processing. We benchmarked smCounter2 against several state-of-the-art UMI-based variant calling methods using multiple datasets and demonstrated smCounter2’s superior performance in detecting somatic variants. At the core of smCounter2 is a statistical test to determine whether the allele frequency of the putative variant is significantly above the background error rate, which was carefully modeled using an independent dataset. The improved accuracy in non-coding regions was mainly achieved using novel repetitive region filters that were specifically designed for UMI data.AvailabilityThe entire pipeline is available at https://github.com/qiaseq/qiaseq-dna under MIT license.

2020 ◽  
Vol 36 (9) ◽  
pp. 2718-2724 ◽  
Author(s):  
Vincent Sater ◽  
Pierre-Julien Viailly ◽  
Thierry Lecroq ◽  
Élise Prieur-Gaston ◽  
Élodie Bohers ◽  
...  

Abstract Motivation Next-generation sequencing has become the go-to standard method for the detection of single-nucleotide variants in tumor cells. The use of such technologies requires a PCR amplification step and a sequencing step, steps in which artifacts are introduced at very low frequencies. These artifacts are often confused with true low-frequency variants that can be found in tumor cells and cell-free DNA. The recent use of unique molecular identifiers (UMI) in targeted sequencing protocols has offered a trustworthy approach to filter out artefactual variants and accurately call low-frequency variants. However, the integration of UMI analysis in the variant calling process led to developing tools that are significantly slower and more memory consuming than raw-reads-based variant callers. Results We present UMI-VarCal, a UMI-based variant caller for targeted sequencing data with better sensitivity compared to other variant callers. Being developed with performance in mind, UMI-VarCal stands out from the crowd by being one of the few variant callers that do not rely on SAMtools to do their pileup. Instead, at its core runs an innovative homemade pileup algorithm specifically designed to treat the UMI tags in the reads. After the pileup, a Poisson statistical test is applied at every position to determine if the frequency of the variant is significantly higher than the background error noise. Finally, an analysis of UMI tags is performed, a strand bias and a homopolymer length filter are applied to achieve better accuracy. We illustrate the results obtained using UMI-VarCal through the sequencing of tumor samples and we show how UMI-VarCal is both faster and more sensitive than other publicly available solutions. Availability and implementation The entire pipeline is available at https://gitlab.com/vincent-sater/umi-varcal-master under MIT license. Supplementary information Supplementary data are available at Bioinformatics online.


2021 ◽  
Vol 3 (1) ◽  
Author(s):  
Gundula Povysil ◽  
Monika Heinzl ◽  
Renato Salazar ◽  
Nicholas Stoler ◽  
Anton Nekrutenko ◽  
...  

Abstract Duplex sequencing is currently the most reliable method to identify ultra-low frequency DNA variants by grouping sequence reads derived from the same DNA molecule into families with information on the forward and reverse strand. However, only a small proportion of reads are assembled into duplex consensus sequences (DCS), and reads with potentially valuable information are discarded at different steps of the bioinformatics pipeline, especially reads without a family. We developed a bioinformatics toolset that analyses the tag and family composition with the purpose to understand data loss and implement modifications to maximize the data output for the variant calling. Specifically, our tools show that tags contain polymerase chain reaction and sequencing errors that contribute to data loss and lower DCS yields. Our tools also identified chimeras, which likely reflect barcode collisions. Finally, we also developed a tool that re-examines variant calls from raw reads and provides different summary data that categorizes the confidence level of a variant call by a tier-based system. With this tool, we can include reads without a family and check the reliability of the call, that increases substantially the sequencing depth for variant calling, a particular important advantage for low-input samples or low-coverage regions.


2019 ◽  
Author(s):  
Gundula Povysil ◽  
Monika Heinzl ◽  
Renato Salazar ◽  
Nicholas Stoler ◽  
Anton Nekrutenko ◽  
...  

AbstractDuplex sequencing is currently the most reliable method to identify ultra-low frequency DNA variants by grouping sequence reads derived from the same DNA molecule into families with information on the forward and reverse strand. However, only a small proportion of reads are assembled into duplex consensus sequences, and reads with potentially valuable information are discarded at different steps of the bioinformatics pipeline, especially reads without a family. We developed a bioinformatics tool-set that analyses the tag and family composition with the purpose to understand data loss and implement modifications to maximize the data output for the variant calling. Specifically, our tools show that tags contain PCR and sequencing errors that contribute to data loss and lower DCS yields. Our tools also identified chimeras, which result in unpaired families that do not form DCS. Finally, we also developed a tool called Variant Analyzer that re-examines variant calls from raw reads and provides different summary data that categorizes the confidence level of a variant call by a tier-based system. We demonstrate that this tool identified false positive variants tagged by the tier-based classification. Furthermore, with this tool we can include reads without a family and check the reliability of the call, which increases substantially the sequencing depth for variant calling, a particular important advantage for low-input samples or low-coverage regions.


2019 ◽  
Vol 3 (4) ◽  
pp. 399-409 ◽  
Author(s):  
Brandon Jew ◽  
Jae Hoon Sul

Abstract Next-generation sequencing has allowed genetic studies to collect genome sequencing data from a large number of individuals. However, raw sequencing data are not usually interpretable due to fragmentation of the genome and technical biases; therefore, analysis of these data requires many computational approaches. First, for each sequenced individual, sequencing data are aligned and further processed to account for technical biases. Then, variant calling is performed to obtain information on the positions of genetic variants and their corresponding genotypes. Quality control (QC) is applied to identify individuals and genetic variants with sequencing errors. These procedures are necessary to generate accurate variant calls from sequencing data, and many computational approaches have been developed for these tasks. This review will focus on current widely used approaches for variant calling and QC.


2016 ◽  
Vol 6 (1) ◽  
Author(s):  
Santosh Anand ◽  
Eleonora Mangano ◽  
Nadia Barizzone ◽  
Roberta Bordoni ◽  
Melissa Sorosina ◽  
...  

Abstract Sequencing large number of individuals, which is often needed for population genetics studies, is still economically challenging despite falling costs of Next Generation Sequencing (NGS). Pool-seq is an alternative cost- and time-effective option in which DNA from several individuals is pooled for sequencing. However, pooling of DNA creates new problems and challenges for accurate variant call and allele frequency (AF) estimation. In particular, sequencing errors confound with the alleles present at low frequency in the pools possibly giving rise to false positive variants. We sequenced 996 individuals in 83 pools (12 individuals/pool) in a targeted re-sequencing experiment. We show that Pool-seq AFs are robust and reliable by comparing them with public variant databases and in-house SNP-genotyping data of individual subjects of pools. Furthermore, we propose a simple filtering guideline for the removal of spurious variants based on the Kolmogorov-Smirnov statistical test. We experimentally validated our filters by comparing Pool-seq to individual sequencing data showing that the filters remove most of the false variants while retaining majority of true variants. The proposed guideline is fairly generic in nature and could be easily applied in other Pool-seq experiments.


2021 ◽  
Vol 12 ◽  
Author(s):  
Carlos Farkas ◽  
Andy Mella ◽  
Maxime Turgeon ◽  
Jody J. Haigh

An unprecedented amount of SARS-CoV-2 sequencing has been performed, however, novel bioinformatic tools to cope with and process these large datasets is needed. Here, we have devised a bioinformatic pipeline that inputs SARS-CoV-2 genome sequencing in FASTA/FASTQ format and outputs a single Variant Calling Format file that can be processed to obtain variant annotations and perform downstream population genetic testing. As proof of concept, we have analyzed over 229,000 SARS-CoV-2 viral sequences up until November 30, 2020. We have identified over 39,000 variants worldwide with increased polymorphisms, spanning the ORF3a gene as well as the 3′ untranslated (UTR) regions, specifically in the conserved stem loop region of SARS-CoV-2 which is accumulating greater observed viral diversity relative to chance variation. Our analysis pipeline has also discovered the existence of SARS-CoV-2 hypermutation with low frequency (less than in 2% of genomes) likely arising through host immune responses and not due to sequencing errors. Among annotated non-sense variants with a population frequency over 1%, recurrent inactivation of the ORF8 gene was found. This was found to be present in the newly identified B.1.1.7 SARS-CoV-2 lineage that originated in the United Kingdom. Almost all VOC-containing genomes possess one stop codon in ORF8 gene (Q27∗), however, 13% of these genomes also contains another stop codon (K68∗), suggesting that ORF8 loss does not interfere with SARS-CoV-2 spread and may play a role in its increased virulence. We have developed this computational pipeline to assist researchers in the rapid analysis and characterization of SARS-CoV-2 variation.


2017 ◽  
Author(s):  
Shibing Deng ◽  
Maruja Lira ◽  
Stephen Huang ◽  
Kai Wang ◽  
Crystal Valdez ◽  
...  

AbstractThe use of ultra-deep, next generation sequencing of circulating tumor DNA (ctDNA) holds great promise for early detection of cancer as well as a tool for monitoring disease progression and therapeutic responses. However, the low abundance of ctDNA in the bloodstream coupled with technical errors introduced during library construction and sequencing complicates mutation detection. To achieve high accuracy of variant calling via better distinguishing low frequency ctDNA mutations from background errors, we introduce TNER (Tri-Nucleotide Error Reducer), a novel background error suppression method that provides a robust estimation of background noise to reduce sequencing errors. It significantly enhances the specificity for downstream ctDNA mutation detection without sacrificing sensitivity. Results on both simulated and real healthy subjects’ data demonstrate that the proposed algorithm consistently outperforms a current, state of the art, position-specific error polishing model, particularly when the sample size of healthy subjects is small. TNER is publicly available at https://github.com/ctDNA/TNER.


2018 ◽  
Author(s):  
Weitai Huang ◽  
Yu Amanda Guo ◽  
Karthik Muthukumar ◽  
Probhonjon Baruah ◽  
Meimei Chang ◽  
...  

ABSTARCTSummarySMuRF is an ensemble method for prediction of somatic point mutations (SNVs) and small insertions/deletions (indels) in cancer genomes. The method integrates predictions and auxiliary features from different somatic mutation callers using a Random Forest machine learning approach. SMuRF is trained on community-curated tumor whole genome sequencing data, is robust across cancer types, and achieves improved accuracy for both SNV and indel predictions of genome and exome-level data. The software is user-friendly and portable by design, operating as an add-on to the community-developed bcbio-nextgen somatic variant calling [email protected]


Author(s):  
Alexandre Yahi ◽  
Paul Hoffman ◽  
Margot Brandt ◽  
Pejman Mohammadi ◽  
Nicholas P. Tatonetti ◽  
...  

AbstractGenome editing experiments are generating an increasing amount of targeted sequencing data with specific mutational patterns indicating the success of the experiments and genotypes of clonal cell lines. We present EdiTyper, a high-throughput command line tool specifically designed for analysis of sequencing data from polyclonal and monoclonal cell populations from CRISPR gene editing. It requires simple inputs of sequencing data and reference sequences, and provides comprehensive outputs including summary statistics, plots, and SAM/BAM alignments. Analysis of simulated data showed that EdiTyper is highly accurate for detection of both single nucleotide mutations and indels, robust to sequencing errors, as well as fast and scalable to large experimental batches. EdiTyper is available in github (https://github.com/LappalainenLab/edityper) under the MIT license.


2021 ◽  
Author(s):  
Michael A. Martin ◽  
Katia Koelle

An early analysis of SARS-CoV-2 deep-sequencing data that combined epidemiological and genetic data to characterize the transmission dynamics of the virus in and beyond Austria concluded that the size of the virus’s transmission bottleneck was large – on the order of 1000 virions. We performed new computational analyses using these deep-sequenced samples from Austria. Our analyses included characterization of transmission bottleneck sizes across a range of variant calling thresholds and examination of patterns of shared low-frequency variants between transmission pairs in cases where de novo genetic variation was present in the recipient. From these analyses, among others, we found that SARS-CoV-2 transmission bottlenecks are instead likely to be very tight, on the order of 1-3 virions. These findings have important consequences for understanding how SARS-CoV-2 evolves between hosts and the processes shaping genetic variation observed at the population level.


Sign in / Sign up

Export Citation Format

Share Document