Family reunion via error correction: an efficient analysis of duplex sequencing data

AbstractDuplex sequencing is the most accurate approach for identification of sequence variants present at very low frequencies. Its power comes from pooling together multiple descendants of both strands of original DNA molecules, which allows distinguishing true nucleotide substitutions from PCR amplification and sequencing artifacts. This strategy comes at a cost—sequencing the same molecule multiple times increases dynamic range but significantly diminishes coverage, making whole genome duplex sequencing prohibitively expensive. Furthermore, every duplex experiment produces a substantial proportion of singleton reads that cannot be used in the analysis and are, technically, thrown away. In this paper we demonstrate that a significant fraction of these reads contains PCR or sequencing errors within duplex tags. Correction of such errors allows “reuniting” these reads with their respective families increasing the output of the method and making it more cost effective. Additionally, we combine error correction strategy with a number of algorithmic improvements in a new version of the duplex analysis software, Du Novo 2.0, readily available through Galaxy, Bioconda, and as the source code.

Download Full-text

Increased yields of duplex sequencing data by a series of quality control tools

NAR Genomics and Bioinformatics ◽

10.1093/nargab/lqab002 ◽

2021 ◽

Vol 3 (1) ◽

Author(s):

Gundula Povysil ◽

Monika Heinzl ◽

Renato Salazar ◽

Nicholas Stoler ◽

Anton Nekrutenko ◽

...

Keyword(s):

Low Frequency ◽

Variant Calling ◽

Data Loss ◽

Sequencing Data ◽

Bioinformatics Pipeline ◽

Consensus Sequences ◽

Sequencing Errors ◽

Data Output ◽

Reverse Strand ◽

Duplex Sequencing

Abstract Duplex sequencing is currently the most reliable method to identify ultra-low frequency DNA variants by grouping sequence reads derived from the same DNA molecule into families with information on the forward and reverse strand. However, only a small proportion of reads are assembled into duplex consensus sequences (DCS), and reads with potentially valuable information are discarded at different steps of the bioinformatics pipeline, especially reads without a family. We developed a bioinformatics toolset that analyses the tag and family composition with the purpose to understand data loss and implement modifications to maximize the data output for the variant calling. Specifically, our tools show that tags contain polymerase chain reaction and sequencing errors that contribute to data loss and lower DCS yields. Our tools also identified chimeras, which likely reflect barcode collisions. Finally, we also developed a tool that re-examines variant calls from raw reads and provides different summary data that categorizes the confidence level of a variant call by a tier-based system. With this tool, we can include reads without a family and check the reliability of the call, that increases substantially the sequencing depth for variant calling, a particular important advantage for low-input samples or low-coverage regions.

Download Full-text

An Error Correction Method of Nanopore Sequencing Data Using Deep Learning

2020 13th International Congress on Image and Signal Processing, BioMedical Engineering and Informatics (CISP-BMEI) ◽

10.1109/cisp-bmei51763.2020.9263622 ◽

2020 ◽

Author(s):

Luotong Wang ◽

Li Qu ◽

Longshu Yang ◽

Yiying Wang ◽

Huaiqiu Zhu

Keyword(s):

Deep Learning ◽

Error Correction ◽

Correction Method ◽

Nanopore Sequencing ◽

Sequencing Data ◽

Error Correction Method

Download Full-text

An Empirical Evaluation of Error Correction Methods and Tools for Next Generation Sequencing Data

International Journal of Advanced Computer Science and Applications ◽

10.14569/ijacsa.2018.090158 ◽

2018 ◽

Vol 9 (1) ◽

Author(s):

Atif Mehmood ◽

Javed Ferzund ◽

Muhammad Usman ◽

Abbas Rehman ◽

Shahzad Ahmed ◽

...

Keyword(s):

Next Generation Sequencing ◽

Error Correction ◽

Empirical Evaluation ◽

Next Generation Sequencing Data ◽

Next Generation ◽

Sequencing Data ◽

Generation Sequencing

Download Full-text

ELECTOR: evaluator for long reads correction methods

NAR Genomics and Bioinformatics ◽

10.1093/nargab/lqz015 ◽

2019 ◽

Vol 2 (1) ◽

Cited By ~ 4

Author(s):

Camille Marchet ◽

Pierre Morisse ◽

Lolita Lecompte ◽

Arnaud Lefebvre ◽

Thierry Lecroq ◽

...

Keyword(s):

Error Correction ◽

Error Rates ◽

Sequencing Data ◽

Multiple Sequence ◽

Long Reads ◽

Wide Range ◽

Unique Method ◽

Algorithmic Strategy ◽

Downstream Processes

Abstract The error rates of third-generation sequencing data have been capped >5%, mainly containing insertions and deletions. Thereby, an increasing number of diverse long reads correction methods have been proposed. The quality of the correction has huge impacts on downstream processes. Therefore, developing methods allowing to evaluate error correction tools with precise and reliable statistics is a crucial need. These evaluation methods rely on costly alignments to evaluate the quality of the corrected reads. Thus, key features must allow the fast comparison of different tools, and scale to the increasing length of the long reads. Our tool, ELECTOR, evaluates long reads correction and is directly compatible with a wide range of error correction tools. As it is based on multiple sequence alignment, we introduce a new algorithmic strategy for alignment segmentation, which enables us to scale to large instances using reasonable resources. To our knowledge, we provide the unique method that allows producing reproducible correction benchmarks on the latest ultra-long reads (>100 k bases). It is also faster than the current state-of-the-art on other datasets and provides a wider set of metrics to assess the read quality improvement after correction. ELECTOR is available on GitHub (https://github.com/kamimrcht/ELECTOR) and Bioconda.

Download Full-text

Error correction and statistical analyses for intra-host comparisons of feline immunodeficiency virus diversity from high-throughput sequencing data

BMC Bioinformatics ◽

10.1186/s12859-015-0607-z ◽

2015 ◽

Vol 16 (1) ◽

Cited By ~ 1

Author(s):

Yang Liu ◽

Francesca Chiaromonte ◽

Howard Ross ◽

Raunaq Malhotra ◽

Daniel Elleder ◽

...

Keyword(s):

Error Correction ◽

High Throughput ◽

Feline Immunodeficiency Virus ◽

High Throughput Sequencing ◽

Statistical Analyses ◽

Sequencing Data ◽

Virus Diversity ◽

High Throughput Sequencing Data ◽

Immunodeficiency Virus

Download Full-text

HiTEC: accurate error correction in high-throughput sequencing data

Bioinformatics ◽

10.1093/bioinformatics/btq653 ◽

2010 ◽

Vol 27 (3) ◽

pp. 295-302 ◽

Cited By ~ 86

Author(s):

L. Ilie ◽

F. Fazayeli ◽

S. Ilie

Keyword(s):

Error Correction ◽

High Throughput ◽

High Throughput Sequencing ◽

Sequencing Data ◽

High Throughput Sequencing Data

Download Full-text

Factorial analysis of error correction performance using simulated next-generation sequencing data

2016 IEEE International Conference on Bioinformatics and Biomedicine (BIBM) ◽

10.1109/bibm.2016.7822685 ◽

2016 ◽

Author(s):

Isaac Akogwu ◽

Nan Wang ◽

Chaoyang Zhang ◽

Hwanseok Choi ◽

Huixiao Hong ◽

...

Keyword(s):

Next Generation Sequencing ◽

Error Correction ◽

Factorial Analysis ◽

Next Generation Sequencing Data ◽

Next Generation ◽

Sequencing Data ◽

Analysis Of Error ◽

Generation Sequencing

Download Full-text

Error correction of next-generation sequencing data and reliable estimation of HIV quasispecies

Nucleic Acids Research ◽

10.1093/nar/gkq655 ◽

2010 ◽

Vol 38 (21) ◽

pp. 7400-7409 ◽

Cited By ~ 154

Author(s):

Osvaldo Zagordi ◽

Rolf Klein ◽

Martin Däumer ◽

Niko Beerenwinkel

Keyword(s):

Next Generation Sequencing ◽

Error Correction ◽

Next Generation Sequencing Data ◽

Next Generation ◽

Sequencing Data ◽

Reliable Estimation ◽

Generation Sequencing

Download Full-text

Illuminating the dark side of the human transcriptome with long read transcript sequencing

10.21203/rs.3.rs-23156/v3 ◽

2020 ◽

Author(s):

Richard Kuo ◽

Yuanyuan Cheng ◽

Runxuan Zhang ◽

John W.S. Brown ◽

Jacqueline Smith ◽

...

Keyword(s):

Data Processing ◽

Error Correction ◽

Human Genome ◽

Parameter Tuning ◽

Dark Side ◽

Sequencing Data ◽

Protein Coding ◽

Human Transcriptome ◽

Model Predictions ◽

Long Read

Abstract Background: The human transcriptome annotation is regarded as one of the most complete of any eukaryotic species. However, limitations in sequencing technologies have biased the annotation toward multi-exonic protein coding genes. Accurate high-throughput long read transcript sequencing can now provide additional evidence for rare transcripts and genes such as mono-exonic and non-coding genes that were previously either undetectable or impossible to differentiate from sequencing noise. Results: We developed the Transcriptome Annotation by Modular Algorithms (TAMA) software to leverage the power of long read transcript sequencing and address the issues with current data processing pipelines. TAMA achieved high sensitivity and precision for gene and transcript model predictions in both reference guided and unguided approaches in our benchmark tests using simulated Pacific Biosciences (PacBio) and Nanopore sequencing data and real PacBio datasets. By analyzing PacBio Sequel II Iso-Seq sequencing data of the Universal Human Reference RNA (UHRR) using TAMA and other commonly used tools, we found that the convention of using alignment identity to measure error correction performance does not reflect actual gain in accuracy of predicted transcript models. In addition, inter-read error correction can cause major changes to read mapping, resulting in potentially over 6K erroneous gene model predictions in the Iso-Seq based human genome annotation. Using TAMA’s genome assembly based error correction and gene feature evidence, we predicted 2,566 putative novel non-coding genes and 1,557 putative novel protein coding gene models.Conclusions: Long read transcript sequencing data has the power to identify novel genes within the highly annotated human genome. The use of parameter tuning and extensive output information of the TAMA software package allows for in depth exploration of eukaryotic transcriptomes. We have found long read data based evidence for thousands of unannotated genes within the human genome. More development in sequencing library preparation and data processing are required for differentiating sequencing noise from real genes in long read RNA sequencing data.

Download Full-text