Increased yields of duplex sequencing data by a series of quality control tools

AbstractDuplex sequencing is currently the most reliable method to identify ultra-low frequency DNA variants by grouping sequence reads derived from the same DNA molecule into families with information on the forward and reverse strand. However, only a small proportion of reads are assembled into duplex consensus sequences, and reads with potentially valuable information are discarded at different steps of the bioinformatics pipeline, especially reads without a family. We developed a bioinformatics tool-set that analyses the tag and family composition with the purpose to understand data loss and implement modifications to maximize the data output for the variant calling. Specifically, our tools show that tags contain PCR and sequencing errors that contribute to data loss and lower DCS yields. Our tools also identified chimeras, which result in unpaired families that do not form DCS. Finally, we also developed a tool called Variant Analyzer that re-examines variant calls from raw reads and provides different summary data that categorizes the confidence level of a variant call by a tier-based system. We demonstrate that this tool identified false positive variants tagged by the tier-based classification. Furthermore, with this tool we can include reads without a family and check the reliability of the call, which increases substantially the sequencing depth for variant calling, a particular important advantage for low-input samples or low-coverage regions.

Download Full-text

Increased yields of duplex sequencing data by a series of quality control tools

NAR Genomics and Bioinformatics ◽

10.1093/nargab/lqab002 ◽

2021 ◽

Vol 3 (1) ◽

Author(s):

Gundula Povysil ◽

Monika Heinzl ◽

Renato Salazar ◽

Nicholas Stoler ◽

Anton Nekrutenko ◽

...

Keyword(s):

Low Frequency ◽

Variant Calling ◽

Data Loss ◽

Sequencing Data ◽

Bioinformatics Pipeline ◽

Consensus Sequences ◽

Sequencing Errors ◽

Data Output ◽

Reverse Strand ◽

Duplex Sequencing

Abstract Duplex sequencing is currently the most reliable method to identify ultra-low frequency DNA variants by grouping sequence reads derived from the same DNA molecule into families with information on the forward and reverse strand. However, only a small proportion of reads are assembled into duplex consensus sequences (DCS), and reads with potentially valuable information are discarded at different steps of the bioinformatics pipeline, especially reads without a family. We developed a bioinformatics toolset that analyses the tag and family composition with the purpose to understand data loss and implement modifications to maximize the data output for the variant calling. Specifically, our tools show that tags contain polymerase chain reaction and sequencing errors that contribute to data loss and lower DCS yields. Our tools also identified chimeras, which likely reflect barcode collisions. Finally, we also developed a tool that re-examines variant calls from raw reads and provides different summary data that categorizes the confidence level of a variant call by a tier-based system. With this tool, we can include reads without a family and check the reliability of the call, that increases substantially the sequencing depth for variant calling, a particular important advantage for low-input samples or low-coverage regions.

Download Full-text

smCounter2: an accurate low-frequency variant caller for targeted sequencing data with unique molecular identifiers

10.1101/281659 ◽

2018 ◽

Cited By ~ 4

Author(s):

Chang Xu ◽

Xiujing Gu ◽

Raghavendra Padmanabhan ◽

Zhong Wu ◽

Quan Peng ◽

...

Keyword(s):

Low Frequency ◽

Variant Calling ◽

Targeted Sequencing ◽

Superior Performance ◽

Sequencing Data ◽

Background Error ◽

Fundamental Limits ◽

Sequencing Errors ◽

Coding Regions ◽

Improved Accuracy

AbstractMotivationLow-frequency DNA mutations are often confounded with technical artifacts from sample preparation and sequencing. With unique molecular identifiers (UMIs), most of the sequencing errors can be corrected. However, errors before UMI tagging, such as DNA polymerase errors during end-repair and the first PCR cycle, cannot be corrected with single-strand UMIs and impose fundamental limits to UMI-based variant calling.ResultsWe developed smCounter2, a UMI-based variant caller for targeted sequencing data and an upgrade from the current version of smCounter. Compared to smCounter, smCounter2 features lower detection limit at 0.5%, better overall accuracy (particularly in non-coding regions), a consistent threshold that can be applied to both deep and shallow sequencing runs, and easier use via a Docker image and code for read pre-processing. We benchmarked smCounter2 against several state-of-the-art UMI-based variant calling methods using multiple datasets and demonstrated smCounter2’s superior performance in detecting somatic variants. At the core of smCounter2 is a statistical test to determine whether the allele frequency of the putative variant is significantly above the background error rate, which was carefully modeled using an independent dataset. The improved accuracy in non-coding regions was mainly achieved using novel repetitive region filters that were specifically designed for UMI data.AvailabilityThe entire pipeline is available at https://github.com/qiaseq/qiaseq-dna under MIT license.

Download Full-text

Population-specific genome graphs improve high-throughput sequencing data analysis: A case study on the Pan-African genome

10.1101/2021.03.19.436173 ◽

2021 ◽

Author(s):

H. Serhat Tetikol ◽

Kubra Narci ◽

Deniz Turgut ◽

Gungor Budak ◽

Ozem Kalay ◽

...

Keyword(s):

High Throughput Sequencing ◽

Information Overload ◽

African Ancestry ◽

Sample Selection ◽

Variant Calling ◽

Population Diversity ◽

Human Populations ◽

Sequencing Data ◽

Bioinformatics Pipeline ◽

Graph Augmentation

ABSTRACTGraph-based genome reference representations have seen significant development, motivated by the inadequacy of the current human genome reference for capturing the diverse genetic information from different human populations and its inability to maintain the same level of accuracy for non-European ancestries. While there have been many efforts to develop computationally efficient graph-based bioinformatics toolkits, how to curate genomic variants and subsequently construct genome graphs remains an understudied problem that inevitably determines the effectiveness of the end-to-end bioinformatics pipeline. In this study, we discuss major obstacles encountered during graph construction and propose methods for sample selection based on population diversity, graph augmentation with structural variants and resolution of graph reference ambiguity caused by information overload. Moreover, we present the case for iteratively augmenting tailored genome graphs for targeted populations and test the proposed approach on the whole-genome samples of African ancestry. Our results show that, as more representative alternatives to linear or generic graph references, population-specific graphs can achieve significantly lower read mapping errors, increased variant calling sensitivity and provide the improvements of joint variant calling without the need of computationally intensive post-processing steps.

Download Full-text

Variant calling and quality control of large-scale human genome sequencing data

Emerging Topics in Life Sciences ◽

10.1042/etls20190007 ◽

2019 ◽

Vol 3 (4) ◽

pp. 399-409 ◽

Cited By ~ 1

Author(s):

Brandon Jew ◽

Jae Hoon Sul

Keyword(s):

Quality Control ◽

Genome Sequencing ◽

Genetic Variants ◽

Large Scale ◽

Variant Calling ◽

Sequencing Data ◽

Computational Approaches ◽

Sequencing Errors ◽

Human Genome Sequencing ◽

Number Of Individuals

Abstract Next-generation sequencing has allowed genetic studies to collect genome sequencing data from a large number of individuals. However, raw sequencing data are not usually interpretable due to fragmentation of the genome and technical biases; therefore, analysis of these data requires many computational approaches. First, for each sequenced individual, sequencing data are aligned and further processed to account for technical biases. Then, variant calling is performed to obtain information on the positions of genetic variants and their corresponding genotypes. Quality control (QC) is applied to identify individuals and genetic variants with sequencing errors. These procedures are necessary to generate accurate variant calls from sequencing data, and many computational approaches have been developed for these tasks. This review will focus on current widely used approaches for variant calling and QC.

Download Full-text

Next Generation Sequencing of Pooled Samples: Guideline for Variants’ Filtering

Scientific Reports ◽

10.1038/srep33735 ◽

2016 ◽

Vol 6 (1) ◽

Cited By ~ 31

Author(s):

Santosh Anand ◽

Eleonora Mangano ◽

Nadia Barizzone ◽

Roberta Bordoni ◽

Melissa Sorosina ◽

...

Keyword(s):

Next Generation Sequencing ◽

Low Frequency ◽

Next Generation ◽

Sequencing Data ◽

Sequencing Errors ◽

Effective Option ◽

Sequencing Experiment ◽

Kolmogorov Smirnov ◽

Next Generation Sequencing Ngs ◽

Generation Sequencing

Abstract Sequencing large number of individuals, which is often needed for population genetics studies, is still economically challenging despite falling costs of Next Generation Sequencing (NGS). Pool-seq is an alternative cost- and time-effective option in which DNA from several individuals is pooled for sequencing. However, pooling of DNA creates new problems and challenges for accurate variant call and allele frequency (AF) estimation. In particular, sequencing errors confound with the alleles present at low frequency in the pools possibly giving rise to false positive variants. We sequenced 996 individuals in 83 pools (12 individuals/pool) in a targeted re-sequencing experiment. We show that Pool-seq AFs are robust and reliable by comparing them with public variant databases and in-house SNP-genotyping data of individual subjects of pools. Furthermore, we propose a simple filtering guideline for the removal of spurious variants based on the Kolmogorov-Smirnov statistical test. We experimentally validated our filters by comparing Pool-seq to individual sequencing data showing that the filters remove most of the false variants while retaining majority of true variants. The proposed guideline is fairly generic in nature and could be easily applied in other Pool-seq experiments.

Download Full-text

Family reunion via error correction: An efficient analysis of duplex sequencing data

10.1101/469106 ◽

2018 ◽

Cited By ~ 1

Author(s):

Nicholas Stoler ◽

Barbara Arbeithuber ◽

Gundula Povysil ◽

Monika Heinzl ◽

Renato Salazar ◽

...

Keyword(s):

Error Correction ◽

Dynamic Range ◽

Pcr Amplification ◽

Cost Effective ◽

Sequencing Data ◽

Nucleotide Substitutions ◽

Low Frequencies ◽

Family Reunion ◽

Sequencing Errors ◽

Duplex Sequencing

AbstractDuplex sequencing is the most accurate approach for identification of sequence variants present at very low frequencies. Its power comes from pooling together multiple descendants of both strands of original DNA molecules, which allows distinguishing true nucleotide substitutions from PCR amplification and sequencing artifacts. This strategy comes at a cost—sequencing the same molecule multiple times increases dynamic range but significantly diminishes coverage, making whole genome duplex sequencing prohibitively expensive. Furthermore, every duplex experiment produces a substantial proportion of singleton reads that cannot be used in the analysis and are, technically, thrown away. In this paper we demonstrate that a significant fraction of these reads contains PCR or sequencing errors within duplex tags. Correction of such errors allows “reuniting” these reads with their respective families increasing the output of the method and making it more cost effective. Additionally, we combine error correction strategy with a number of algorithmic improvements in a new version of the duplex analysis software, Du Novo 2.0, readily available through Galaxy, Bioconda, and as the source code.

Download Full-text

A Novel SARS-CoV-2 Viral Sequence Bioinformatic Pipeline Has Found Genetic Evidence That the Viral 3′ Untranslated Region (UTR) Is Evolving and Generating Increased Viral Diversity

Frontiers in Microbiology ◽

10.3389/fmicb.2021.665041 ◽

2021 ◽

Vol 12 ◽

Author(s):

Carlos Farkas ◽

Andy Mella ◽

Maxime Turgeon ◽

Jody J. Haigh

Keyword(s):

Stop Codon ◽

Low Frequency ◽

Variant Calling ◽

Viral Diversity ◽

Stem Loop ◽

Bioinformatic Pipeline ◽

Host Immune Responses ◽

Sequencing Errors ◽

Bioinformatic Tools ◽

Loop Region

An unprecedented amount of SARS-CoV-2 sequencing has been performed, however, novel bioinformatic tools to cope with and process these large datasets is needed. Here, we have devised a bioinformatic pipeline that inputs SARS-CoV-2 genome sequencing in FASTA/FASTQ format and outputs a single Variant Calling Format file that can be processed to obtain variant annotations and perform downstream population genetic testing. As proof of concept, we have analyzed over 229,000 SARS-CoV-2 viral sequences up until November 30, 2020. We have identified over 39,000 variants worldwide with increased polymorphisms, spanning the ORF3a gene as well as the 3′ untranslated (UTR) regions, specifically in the conserved stem loop region of SARS-CoV-2 which is accumulating greater observed viral diversity relative to chance variation. Our analysis pipeline has also discovered the existence of SARS-CoV-2 hypermutation with low frequency (less than in 2% of genomes) likely arising through host immune responses and not due to sequencing errors. Among annotated non-sense variants with a population frequency over 1%, recurrent inactivation of the ORF8 gene was found. This was found to be present in the newly identified B.1.1.7 SARS-CoV-2 lineage that originated in the United Kingdom. Almost all VOC-containing genomes possess one stop codon in ORF8 gene (Q27∗), however, 13% of these genomes also contains another stop codon (K68∗), suggesting that ORF8 loss does not interfere with SARS-CoV-2 spread and may play a role in its increased virulence. We have developed this computational pipeline to assist researchers in the rapid analysis and characterization of SARS-CoV-2 variation.

Download Full-text

Reanalysis of deep-sequencing data from Austria points towards a small SARS-COV-2 transmission bottleneck on the order of one to three virions

10.1101/2021.02.22.432096 ◽

2021 ◽

Author(s):

Michael A. Martin ◽

Katia Koelle

Keyword(s):

Genetic Variation ◽

Deep Sequencing ◽

De Novo ◽

Low Frequency ◽

Variant Calling ◽

Population Level ◽

Sequencing Data ◽

Deep Sequencing Data ◽

Computational Analyses ◽

Transmission Bottleneck

An early analysis of SARS-CoV-2 deep-sequencing data that combined epidemiological and genetic data to characterize the transmission dynamics of the virus in and beyond Austria concluded that the size of the virus’s transmission bottleneck was large – on the order of 1000 virions. We performed new computational analyses using these deep-sequenced samples from Austria. Our analyses included characterization of transmission bottleneck sizes across a range of variant calling thresholds and examination of patterns of shared low-frequency variants between transmission pairs in cases where de novo genetic variation was present in the recipient. From these analyses, among others, we found that SARS-CoV-2 transmission bottlenecks are instead likely to be very tight, on the order of 1-3 virions. These findings have important consequences for understanding how SARS-CoV-2 evolves between hosts and the processes shaping genetic variation observed at the population level.

Download Full-text

LFMD: detecting low-frequency mutations in high-depth genome sequencing data without molecular tags

10.1101/617381 ◽

2019 ◽

Author(s):

Rui Ye ◽

Xuehan Zhuang ◽

Jie Ruan ◽

Yanwei Qi ◽

Yitai An ◽

...

Keyword(s):

Oxidative Damage ◽

Low Frequency ◽

Sequencing Data ◽

Free Radical Theory ◽

Radical Theory ◽

Drug Resistance Prediction ◽

Theory Of Aging ◽

Mitochondrial Heterogeneity ◽

Duplex Sequencing ◽

Next Generation Sequencing Ngs

AbstractAs next-generation sequencing (NGS) and liquid biopsy become more prevalent in research and in the clinic, there is an increasing need for better methods to reduce cost and improve sensitivity and specificity of low-frequency mutation detection (where the Alternative Allele Frequency, or AAF, is less than 1%). Here we propose a likelihood-based approach, called Low-Frequency Mutation Detector (LFMD), which combines the advantages of duplex sequencing (DS) and the bottleneck sequencing system (BotSeqS) to maximize the utilization of duplicate reads. Compared with the existing state-of-the-art methods, DS, Du Novo, UMI-tools, and Unified Consensus Maker, our method achieves higher sensitivity, higher specificity (< 4 × 10−10 errors per base sequenced) and lower cost (reduced by ~70% at best) without involving additional experimental steps, customized adapters or molecular tags. LFMD is useful in areas where high precision is required, such as drug resistance prediction and cancer screening. As an example of LFMD’s applications, mitochondrial heterogeneity analysis of 28 human brain samples across different stages of Alzheimer’s Disease (AD) showed that the canonical oxidative damage related mutations, C:G>A:T, are significantly increased in the mid-stage group. This is consistent with the Mitochondrial Free Radical Theory of Aging, suggesting that AD may be linked to the aging of brain cells induced by oxidative damage.

Download Full-text

UMI-VarCal: a new UMI-based variant caller that efficiently improves low-frequency variant detection in paired-end sequencing NGS libraries

Bioinformatics ◽

10.1093/bioinformatics/btaa053 ◽

2020 ◽

Vol 36 (9) ◽

pp. 2718-2724 ◽

Cited By ~ 5

Author(s):

Vincent Sater ◽

Pierre-Julien Viailly ◽

Thierry Lecroq ◽

Élise Prieur-Gaston ◽

Élodie Bohers ◽

...

Keyword(s):

Tumor Cells ◽

Low Frequency ◽

Variant Calling ◽

Pcr Amplification ◽

Targeted Sequencing ◽

Supplementary Information ◽

Sequencing Data ◽

Single Nucleotide Variants ◽

Background Error ◽

Low Frequencies

Abstract Motivation Next-generation sequencing has become the go-to standard method for the detection of single-nucleotide variants in tumor cells. The use of such technologies requires a PCR amplification step and a sequencing step, steps in which artifacts are introduced at very low frequencies. These artifacts are often confused with true low-frequency variants that can be found in tumor cells and cell-free DNA. The recent use of unique molecular identifiers (UMI) in targeted sequencing protocols has offered a trustworthy approach to filter out artefactual variants and accurately call low-frequency variants. However, the integration of UMI analysis in the variant calling process led to developing tools that are significantly slower and more memory consuming than raw-reads-based variant callers. Results We present UMI-VarCal, a UMI-based variant caller for targeted sequencing data with better sensitivity compared to other variant callers. Being developed with performance in mind, UMI-VarCal stands out from the crowd by being one of the few variant callers that do not rely on SAMtools to do their pileup. Instead, at its core runs an innovative homemade pileup algorithm specifically designed to treat the UMI tags in the reads. After the pileup, a Poisson statistical test is applied at every position to determine if the frequency of the variant is significantly higher than the background error noise. Finally, an analysis of UMI tags is performed, a strand bias and a homopolymer length filter are applied to achieve better accuracy. We illustrate the results obtained using UMI-VarCal through the sequencing of tumor samples and we show how UMI-VarCal is both faster and more sensitive than other publicly available solutions. Availability and implementation The entire pipeline is available at https://gitlab.com/vincent-sater/umi-varcal-master under MIT license. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text