read error correction Latest Research Papers

Abstract Summary Partial order alignment, which aligns a sequence to a directed acyclic graph, is now frequently used as a key component in long-read error correction and assembly. We present abPOA (adaptive banded Partial Order Alignment), a Single Instruction Multiple Data (SIMD)-based C library for fast partial order alignment using adaptive banded dynamic programming. It can work as a stand-alone multiple sequence alignment and consensus calling tool or be easily integrated into any long-read error correction and assembly workflow. Compared to a state-of-the-art tool (SPOA), abPOA is up to 10 times faster with a comparable alignment accuracy. Availability and implementation abPOA is implemented in C. A stand-alone tool and a C/Python software interface are freely available at https://github.com/yangao07/abPOA. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

A comprehensive evaluation of long read error correction methods

BMC Genomics ◽

10.1186/s12864-020-07227-0 ◽

2020 ◽

Vol 21 (S6) ◽

Author(s):

Haowen Zhang ◽

Chirag Jain ◽

Srinivas Aluru

Keyword(s):

Error Correction ◽

Hybrid Methods ◽

Comprehensive Evaluation ◽

Data Sets ◽

Short Reads ◽

Long Reads ◽

Long Read ◽

Read Error Correction ◽

Downstream Analysis ◽

The Impact

Abstract Background Third-generation single molecule sequencing technologies can sequence long reads, which is advancing the frontiers of genomics research. However, their high error rates prohibit accurate and efficient downstream analysis. This difficulty has motivated the development of many long read error correction tools, which tackle this problem through sampling redundancy and/or leveraging accurate short reads of the same biological samples. Existing studies to asses these tools use simulated data sets, and are not sufficiently comprehensive in the range of software covered or diversity of evaluation measures used. Results In this paper, we present a categorization and review of long read error correction methods, and provide a comprehensive evaluation of the corresponding long read error correction tools. Leveraging recent real sequencing data, we establish benchmark data sets and set up evaluation criteria for a comparative assessment which includes quality of error correction as well as run-time and memory usage. We study how trimming and long read sequencing depth affect error correction in terms of length distribution and genome coverage post-correction, and the impact of error correction performance on an important application of long reads, genome assembly. We provide guidelines for practitioners for choosing among the available error correction tools and identify directions for future research. Conclusions Despite the high error rate of long reads, the state-of-the-art correction tools can achieve high correction quality. When short reads are available, the best hybrid methods outperform non-hybrid methods in terms of correction quality and computing resource usage. When choosing tools for use, practitioners are suggested to be careful with a few correction tools that discard reads, and check the effect of error correction tools on downstream analysis. Our evaluation code is available as open-source at https://github.com/haowenz/LRECE.

Download Full-text

CARE: context-aware sequencing read error correction

Bioinformatics ◽

10.1093/bioinformatics/btaa738 ◽

2020 ◽

Author(s):

Felix Kallenborn ◽

Andreas Hildebrandt ◽

Bertil Schmidt

Keyword(s):

Error Correction ◽

False Positive ◽

De Novo ◽

Supplementary Information ◽

De Novo Genome Assembly ◽

Multiple Sequence ◽

Sequencing Errors ◽

Multiple Alignments ◽

Read Error Correction ◽

Error Correction Algorithm

Abstract Motivation Error correction is a fundamental pre-processing step in many Next-Generation Sequencing (NGS) pipelines, in particular for de novo genome assembly. However, existing error correction methods either suffer from high false-positive rates since they break reads into independent k-mers or do not scale efficiently to large amounts of sequencing reads and complex genomes. Results We present CARE—an alignment-based scalable error correction algorithm for Illumina data using the concept of minhashing. Minhashing allows for efficient similarity search within large sequencing read collections which enables fast computation of high-quality multiple alignments. Sequencing errors are corrected by detailed inspection of the corresponding alignments. Our performance evaluation shows that CARE generates significantly fewer false-positive corrections than state-of-the-art tools (Musket, SGA, BFC, Lighter, Bcool, Karect) while maintaining a competitive number of true positives. When used prior to assembly it can achieve superior de novo assembly results for a number of real datasets. CARE is also the first multiple sequence alignment-based error corrector that is able to process a human genome Illumina NGS dataset in only 4 h on a single workstation using GPU acceleration. Availabilityand implementation CARE is open-source software written in C++ (CPU version) and in CUDA/C++ (GPU version). It is licensed under GPLv3 and can be downloaded at https://github.com/fkallen/CARE. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

abPOA: an SIMD-based C library for fast partial order alignment using adaptive band

10.1101/2020.05.07.083196 ◽

2020 ◽

Author(s):

Yan Gao ◽

Yongzhuang Liu ◽

Yanmei Ma ◽

Bo Liu ◽

Yadong Wang ◽

...

Keyword(s):

Error Correction ◽

Partial Order ◽

Directed Acyclic Graph ◽

State Of The Art ◽

Single Instruction Multiple Data ◽

Multiple Sequence ◽

Software Interface ◽

Multiple Data ◽

Long Read ◽

Read Error Correction

AbstractSummaryPartial order alignment, which aligns a sequence to a directed acyclic graph, is now frequently used as a key component in long-read error correction and assembly. We present abPOA (adaptive banded Partial Order Alignment), a Single Instruction Multiple Data (SIMD) based C library for fast partial order alignment using adaptive banded dynamic programming. It can work as a stand-alone multiple sequence alignment and consensus calling tool or be easily integrated into any long-read error correction and assembly workflow. Compared to a state-of-the-art tool (SPOA), abPOA is up to 15 times faster with a comparable alignment accuracy.Availability and implementationabPOA is implemented in C. A stand-alone tool and a C/Python software interface are freely available at https://github.com/yangao07/[email protected] or [email protected]

Download Full-text

Illuminating the dark side of the human transcriptome with long read transcript sequencing

10.21203/rs.3.rs-23156/v1 ◽

2020 ◽

Author(s):

Richard Kuo ◽

Yuanyuan Cheng ◽

Runxuan Zhang ◽

John W.S. Brown ◽

Jacqueline Smith ◽

...

Keyword(s):

Error Correction ◽

Rna Sequencing ◽

Dark Side ◽

Sequencing Data ◽

Human Transcriptome ◽

Sequencing Technologies ◽

Long Read ◽

Read Error Correction ◽

Gene Models ◽

Genome Annotations

Abstract Background The human transcriptome annotation is regarded as one of the most complete of any eukaryotic species. However, limitations in sequencing technologies have biased the annotation toward multi-exonic protein coding genes. Accurate high-throughput long read transcript sequencing can now provide stronger evidence for genes that were previously either undetectable or impossible to differentiate from sequencing noise such as rare transcripts, mono-exonic, and non-coding genes.Results We analyzed Sequel II Iso-Seq sequencing data of the Universal Human Reference RNA (UHRR) using the Transcriptome Annotation by Modular Algorithms (TAMA) software. We found that the convention of using mapping identity to measure error correction performance does not reflect actual gain in accuracy of predicted transcript models. In addition, inter-read error correction leads to the thousands of erroneous gene models. Using genome assembly based error correction and gene feature evidence, we identified thousands of potentially functional novel genes.Conclusions The standard of using inter-read error correction for long read RNA sequencing data could be responsible for genome annotations with thousands of biologically inaccurate gene models. More than half of all real genes in the human genome may still be missing in current public annotations. We require better methods for differentiating sequencing noise from real genes in long read RNA sequencing data.

Download Full-text

Long-read error correction: a survey and qualitative comparison

10.1101/2020.03.06.977975 ◽

2020 ◽

Cited By ~ 2

Author(s):

Pierre Morisse ◽

Thierry Lecroq ◽

Arnaud Lefebvre

Keyword(s):

Error Correction ◽

Second Generation ◽

Hybrid Approach ◽

Error Rates ◽

Sequencing Technology ◽

Sequencing Technologies ◽

Long Reads ◽

Long Read ◽

Read Error Correction ◽

Generation Sequencing

AbstractThird generation sequencing technologies Pacific Biosciences and Oxford Nanopore Technologies were respectively made available in 2011 and 2014. In contrast with second generation sequencing technologies such as Illumina, these new technologies allow the sequencing of long reads of tens to hundreds of kbps. These so called long reads are particularly promising, and are especially expected to solve various problems such as contig and haplotype assembly or scaffolding, for instance. However, these reads are also much more error prone than second generation reads, and display error rates reaching 10 to 30%, according to the sequencing technology and to the version of the chemistry. Moreover, these errors are mainly composed of insertions and deletions, whereas most errors are substitutions in Illumina reads. As a result, long reads require efficient error correction, and a plethora of error correction tools, directly targeted at these reads, were developed in the past nine years. These methods can adopt a hybrid approach, using complementary short reads to perform correction, or a self-correction approach, only making use of the information contained in the long reads sequences. Both these approaches make use of various strategies such as multiple sequence alignment, de Bruijn graphs, hidden Markov models, or even combine different strategies. In this paper, we describe a complete survey of long-read error correction, reviewing all the different methodologies and tools existing up to date, for both hybrid and self-correction. Moreover, the long reads characteristics, such as sequencing depth, length, error rate, or even sequencing technology, can have an impact on how well a given tool or strategy performs, and can thus drastically reduce the correction quality. We thus also present an in-depth benchmark of available long-read error correction tools, on a wide variety of datasets, composed of both simulated and real data, with various error rates, coverages, and read lengths, ranging from small bacterial to large mammal genomes.

Download Full-text

Complete, closed bacterial genomes from microbiomes using nanopore sequencing

Nature Biotechnology ◽

10.1038/s41587-020-0422-6 ◽

2020 ◽

Vol 38 (6) ◽

pp. 701-707 ◽

Cited By ~ 34

Author(s):

Eli L. Moss ◽

Dylan G. Maghini ◽

Ami S. Bhatt

Keyword(s):

Bacterial Species ◽

Genome Structure ◽

Nanopore Sequencing ◽

Sequencing Data ◽

Bacterial Genomes ◽

Short Read ◽

Repeat Elements ◽

Long Read ◽

Metagenomics Data ◽

Read Error Correction

AbstractMicrobial genomes can be assembled from short-read sequencing data, but the assembly contiguity of these metagenome-assembled genomes is constrained by repeat elements. Correct assignment of genomic positions of repeats is crucial for understanding the effect of genome structure on genome function. We applied nanopore sequencing and our workflow, named Lathe, which incorporates long-read assembly and short-read error correction, to assemble closed bacterial genomes from complex microbiomes. We validated our approach with a synthetic mixture of 12 bacterial species. Seven genomes were completely assembled into single contigs and three genomes were assembled into four or fewer contigs. Next, we used our methods to analyze metagenomics data from 13 human stool samples. We assembled 20 circular genomes, including genomes of Prevotella copri and a candidate Cibiobacter sp. Despite the decreased nucleotide accuracy compared with alternative sequencing and assembly approaches, our methods improved assembly contiguity, allowing for investigation of the role of repeat elements in microbial function and adaptation.

Download Full-text

De novo Nanopore read quality improvement using deep learning

BMC Bioinformatics ◽

10.1186/s12859-019-3103-z ◽

2019 ◽

Vol 20 (1) ◽

Cited By ~ 4

Author(s):

Nathan LaPierre ◽

Rob Egan ◽

Wei Wang ◽

Zhong Wang

Keyword(s):

Error Correction ◽

Genome Assembly ◽

Large Scale ◽

De Novo ◽

Error Rates ◽

De Novo Genome Assembly ◽

Sequencing Technologies ◽

Oxford Nanopore ◽

Long Read ◽

Read Error Correction

Abstract Background Long read sequencing technologies such as Oxford Nanopore can greatly decrease the complexity of de novo genome assembly and large structural variation identification. Currently Nanopore reads have high error rates, and the errors often cluster into low-quality segments within the reads. The limited sensitivity of existing read-based error correction methods can cause large-scale mis-assemblies in the assembled genomes, motivating further innovation in this area. Results Here we developed a Convolutional Neural Network (CNN) based method, called MiniScrub, for identification and subsequent “scrubbing” (removal) of low-quality Nanopore read segments to minimize their interference in downstream assembly process. MiniScrub first generates read-to-read overlaps via MiniMap2, then encodes the overlaps into images, and finally builds CNN models to predict low-quality segments. Applying MiniScrub to real world control datasets under several different parameters, we show that it robustly improves read quality, and improves read error correction in the metagenome setting. Compared to raw reads, de novo genome assembly with scrubbed reads produces many fewer mis-assemblies and large indel errors. Conclusions MiniScrub is able to robustly improve read quality of Oxford Nanopore reads, especially in the metagenome setting, making it useful for downstream applications such as de novo assembly. We propose MiniScrub as a tool for preprocessing Nanopore reads for downstream analyses. MiniScrub is open-source software and is available at https://bitbucket.org/berkeleylab/jgi-miniscrub.

Download Full-text

Comparative assessment of long-read error correction software applied to Nanopore RNA-sequencing data

Briefings in Bioinformatics ◽

10.1093/bib/bbz058 ◽

2019 ◽

Vol 21 (4) ◽

pp. 1164-1181 ◽

Cited By ~ 9

Author(s):

Leandro Lima ◽

Camille Marchet ◽

Ségolène Caboche ◽

Corinne Da Silva ◽

Benjamin Istace ◽

...

Keyword(s):

Error Correction ◽

Rna Sequencing ◽

Gene Families ◽

Error Rates ◽

Open Reading Frames ◽

Sequencing Data ◽

Isoform Diversity ◽

Long Reads ◽

Long Read ◽

Read Error Correction

Abstract Motivation Nanopore long-read sequencing technology offers promising alternatives to high-throughput short read sequencing, especially in the context of RNA-sequencing. However this technology is currently hindered by high error rates in the output data that affect analyses such as the identification of isoforms, exon boundaries, open reading frames and creation of gene catalogues. Due to the novelty of such data, computational methods are still actively being developed and options for the error correction of Nanopore RNA-sequencing long reads remain limited. Results In this article, we evaluate the extent to which existing long-read DNA error correction methods are capable of correcting cDNA Nanopore reads. We provide an automatic and extensive benchmark tool that not only reports classical error correction metrics but also the effect of correction on gene families, isoform diversity, bias toward the major isoform and splice site detection. We find that long read error correction tools that were originally developed for DNA are also suitable for the correction of Nanopore RNA-sequencing data, especially in terms of increasing base pair accuracy. Yet investigators should be warned that the correction process perturbs gene family sizes and isoform diversity. This work provides guidelines on which (or whether) error correction tools should be used, depending on the application type. Benchmarking software https://gitlab.com/leoisl/LR_EC_analyser

Download Full-text

read error correction
Recently Published Documents

TOTAL DOCUMENTS

H-INDEX

Long Read Error Correction Algorithm Based on the de Bruijn Graph for the Third-generation Sequencing

abPOA: an SIMD-based C library for fast partial order alignment using adaptive band

A comprehensive evaluation of long read error correction methods

CARE: context-aware sequencing read error correction

abPOA: an SIMD-based C library for fast partial order alignment using adaptive band

Illuminating the dark side of the human transcriptome with long read transcript sequencing

Long-read error correction: a survey and qualitative comparison

Complete, closed bacterial genomes from microbiomes using nanopore sequencing

De novo Nanopore read quality improvement using deep learning

Comparative assessment of long-read error correction software applied to Nanopore RNA-sequencing data

Export Citation Format

read error correctionRecently Published Documents

TOTAL DOCUMENTS

H-INDEX

Long Read Error Correction Algorithm Based on the de Bruijn Graph for the Third-generation Sequencing

abPOA: an SIMD-based C library for fast partial order alignment using adaptive band

A comprehensive evaluation of long read error correction methods

CARE: context-aware sequencing read error correction

abPOA: an SIMD-based C library for fast partial order alignment using adaptive band

Illuminating the dark side of the human transcriptome with long read transcript sequencing

Long-read error correction: a survey and qualitative comparison

Complete, closed bacterial genomes from microbiomes using nanopore sequencing

De novo Nanopore read quality improvement using deep learning

Comparative assessment of long-read error correction software applied to Nanopore RNA-sequencing data

read error correction
Recently Published Documents