ELECTOR: evaluator for long reads correction methods

Abstract The error rates of third-generation sequencing data have been capped >5%, mainly containing insertions and deletions. Thereby, an increasing number of diverse long reads correction methods have been proposed. The quality of the correction has huge impacts on downstream processes. Therefore, developing methods allowing to evaluate error correction tools with precise and reliable statistics is a crucial need. These evaluation methods rely on costly alignments to evaluate the quality of the corrected reads. Thus, key features must allow the fast comparison of different tools, and scale to the increasing length of the long reads. Our tool, ELECTOR, evaluates long reads correction and is directly compatible with a wide range of error correction tools. As it is based on multiple sequence alignment, we introduce a new algorithmic strategy for alignment segmentation, which enables us to scale to large instances using reasonable resources. To our knowledge, we provide the unique method that allows producing reproducible correction benchmarks on the latest ultra-long reads (>100 k bases). It is also faster than the current state-of-the-art on other datasets and provides a wider set of metrics to assess the read quality improvement after correction. ELECTOR is available on GitHub (https://github.com/kamimrcht/ELECTOR) and Bioconda.

Download Full-text

ELECTOR: Evaluator for long reads correction methods

10.1101/512889 ◽

2019 ◽

Cited By ~ 1

Author(s):

Camille Marchet ◽

Pierre Morisse ◽

Lolita Lecompte ◽

Arnaud Lefebvre ◽

Thierry Lecroq ◽

...

Keyword(s):

Error Correction ◽

State Of The Art ◽

Error Rates ◽

Sequencing Data ◽

Third Generation Sequencing ◽

Long Reads ◽

Wide Range ◽

Downstream Processes ◽

Generation Sequencing

AbstractMotivationIn the last few years, the error rates of third generation sequencing data have been capped above 5%, including many insertions and deletions. Thereby, an increasing number of long reads correction methods have been proposed to reduce the noise in these sequences. Whether hybrid or self-correction methods, there exist multiple approaches to correct long reads. As the quality of the error correction has huge impacts on downstream processes, developing methods allowing to evaluate error correction tools with precise and reliable statistics is therefore a crucial need. Since error correction is often a resource bottleneck in long reads pipelines, a key feature of assessment methods is therefore to be efficient, in order to allow the fast comparison of different tools.ResultsWe propose ELECTOR, a reliable and efficient tool to evaluate long reads correction, that enables the evaluation of hybrid and self-correction methods. Our tool provides a complete and relevant set of metrics to assess the read quality improvement after correction and scales to large datasets. ELECTOR is directly compatible with a wide range of state-of-the-art error correction tools, using whether simulated or real long reads. We show that ELECTOR displays a wider range of metrics than the state-of-the-art tool, LRCstats, and additionally importantly decreases the runtime needed for assessment on all the studied datasets.AvailabilityELECTOR is available at https://github.com/kamimrcht/[email protected] or [email protected]

Download Full-text

Comparative assessment of long-read error correction software applied to Nanopore RNA-sequencing data

Briefings in Bioinformatics ◽

10.1093/bib/bbz058 ◽

2019 ◽

Vol 21 (4) ◽

pp. 1164-1181 ◽

Cited By ~ 9

Author(s):

Leandro Lima ◽

Camille Marchet ◽

Ségolène Caboche ◽

Corinne Da Silva ◽

Benjamin Istace ◽

...

Keyword(s):

Error Correction ◽

Rna Sequencing ◽

Gene Families ◽

Error Rates ◽

Open Reading Frames ◽

Sequencing Data ◽

Isoform Diversity ◽

Long Reads ◽

Long Read ◽

Read Error Correction

Abstract Motivation Nanopore long-read sequencing technology offers promising alternatives to high-throughput short read sequencing, especially in the context of RNA-sequencing. However this technology is currently hindered by high error rates in the output data that affect analyses such as the identification of isoforms, exon boundaries, open reading frames and creation of gene catalogues. Due to the novelty of such data, computational methods are still actively being developed and options for the error correction of Nanopore RNA-sequencing long reads remain limited. Results In this article, we evaluate the extent to which existing long-read DNA error correction methods are capable of correcting cDNA Nanopore reads. We provide an automatic and extensive benchmark tool that not only reports classical error correction metrics but also the effect of correction on gene families, isoform diversity, bias toward the major isoform and splice site detection. We find that long read error correction tools that were originally developed for DNA are also suitable for the correction of Nanopore RNA-sequencing data, especially in terms of increasing base pair accuracy. Yet investigators should be warned that the correction process perturbs gene family sizes and isoform diversity. This work provides guidelines on which (or whether) error correction tools should be used, depending on the application type. Benchmarking software https://gitlab.com/leoisl/LR_EC_analyser

Download Full-text

Quality of Third Generation Sequencing

Journal of Computational and Theoretical Nanoscience ◽

10.1166/jctn.2020.9630 ◽

2020 ◽

Vol 17 (12) ◽

pp. 5205-5209

Author(s):

Ali Elbialy ◽

M. A. El-Dosuky ◽

Ibrahim M. El-Henawy

Keyword(s):

Neural Network ◽

Deep Neural Network ◽

Gc Content ◽

Error Rates ◽

Third Generation ◽

Third Generation Sequencing ◽

Long Reads ◽

Generation Sequencing

Third generation sequencing (TGS) relates to long reads but with relatively high error rates. Quality of TGS is a hot topic, dealing with errors. This paper combines and investigates three quality related metrics. They are basecalling accuracy, Phred Quality Scores, and GC content. For basecalling accuracy, a deep neural network is adopted. The measured loss does not exceed 5.42.

Download Full-text

Evaluation of hybrid and non-hybrid methods for de novo assembly of nanopore reads

10.1101/030437 ◽

2015 ◽

Cited By ~ 3

Author(s):

Ivan Sovic ◽

Kresimir Krizanovic ◽

Karolj Skala ◽

Mile Sikic

Keyword(s):

De Novo Assembly ◽

De Novo ◽

Hybrid Methods ◽

Bacterial Genome ◽

Error Rates ◽

Sequencing Data ◽

E Coli ◽

Recent Emergence ◽

K 12

Recent emergence of nanopore sequencing technology set a challenge for the established assembly methods not optimized for the combination of read lengths and high error rates of nanopore reads. In this work we assessed how existing de novo assembly methods perform on these reads. We benchmarked three non-hybrid (in terms of both error correction and scaffolding) assembly pipelines as well as two hybrid assemblers which use third generation sequencing data to scaffold Illumina assemblies. Tests were performed on several publicly available MinION and Illumina datasets of E. coli K-12, using several sequencing coverages of nanopore data (20x, 30x, 40x and 50x). We attempted to assess the quality of assembly at each of these coverages, to estimate the requirements for closed bacterial genome assembly. Results show that hybrid methods are highly dependent on the quality of NGS data, but much less on the quality and coverage of nanopore data and perform relatively well on lower nanopore coverages. Furthermore, when coverage is above 40x, all non-hybrid methods correctly assemble the E. coli genome, even a non-hybrid method tailored for Pacific Bioscience reads. While it requires higher coverage compared to a method designed particularly for nanopore reads, its running time is significantly lower.

Download Full-text

A hybrid and scalable error correction algorithm for indel and substitution errors of long reads

BMC Genomics ◽

10.1186/s12864-019-6286-9 ◽

2019 ◽

Vol 20 (S11) ◽

Author(s):

Arghya Kusum Das ◽

Sayan Goswami ◽

Kisung Lee ◽

Seung-Jong Park

Keyword(s):

Error Correction ◽

Error Rates ◽

De Bruijn Graph ◽

Correction Algorithm ◽

Short Read ◽

Short Reads ◽

Long Reads ◽

Long Read ◽

De Bruijn ◽

Error Correction Algorithm

Abstract Background Long-read sequencing has shown the promises to overcome the short length limitations of second-generation sequencing by providing more complete assembly. However, the computation of the long sequencing reads is challenged by their higher error rates (e.g., 13% vs. 1%) and higher cost ($0.3 vs. $0.03 per Mbp) compared to the short reads. Methods In this paper, we present a new hybrid error correction tool, called ParLECH (Parallel Long-read Error Correction using Hybrid methodology). The error correction algorithm of ParLECH is distributed in nature and efficiently utilizes the k-mer coverage information of high throughput Illumina short-read sequences to rectify the PacBio long-read sequences.ParLECH first constructs a de Bruijn graph from the short reads, and then replaces the indel error regions of the long reads with their corresponding widest path (or maximum min-coverage path) in the short read-based de Bruijn graph. ParLECH then utilizes the k-mer coverage information of the short reads to divide each long read into a sequence of low and high coverage regions, followed by a majority voting to rectify each substituted error base. Results ParLECH outperforms latest state-of-the-art hybrid error correction methods on real PacBio datasets. Our experimental evaluation results demonstrate that ParLECH can correct large-scale real-world datasets in an accurate and scalable manner. ParLECH can correct the indel errors of human genome PacBio long reads (312 GB) with Illumina short reads (452 GB) in less than 29 h using 128 compute nodes. ParLECH can align more than 92% bases of an E. coli PacBio dataset with the reference genome, proving its accuracy. Conclusion ParLECH can scale to over terabytes of sequencing data using hundreds of computing nodes. The proposed hybrid error correction methodology is novel and rectifies both indel and substitution errors present in the original long reads or newly introduced by the short reads.

Download Full-text

Comparative assessment of long-read error-correction software applied to RNA-sequencing data

10.1101/476622 ◽

2018 ◽

Cited By ~ 2

Author(s):

Leandro Lima ◽

Camille Marchet ◽

Ségolène Caboche ◽

Corinne Da Silva ◽

Benjamin Istace ◽

...

Keyword(s):

Error Correction ◽

Rna Sequencing ◽

Gene Families ◽

Error Rates ◽

Open Reading Frames ◽

Sequencing Data ◽

Sequencing Technologies ◽

Isoform Diversity ◽

Long Read ◽

Read Error Correction

AbstractMotivationLong-read sequencing technologies offer promising alternatives to high-throughput short read sequencing, especially in the context of RNA-sequencing. However these technologies are currently hindered by high error rates in the output data that affect analyses such as the identification of isoforms, exon boundaries, open reading frames, and the creation of gene catalogues. Due to the novelty of such data, computational methods are still actively being developed and options for the error-correction of RNA-sequencing long reads remain limited.ResultsIn this article, we evaluate the extent to which existing long-read DNA error correction methods are capable of correcting cDNA Nanopore reads. We provide an automatic and extensive benchmark tool that not only reports classical error-correction metrics but also the effect of correction on gene families, isoform diversity, bias towards the major isoform, and splice site detection. We find that long read error-correction tools that were originally developed for DNA are also suitable for the correction of RNA-sequencing data, especially in terms of increasing base-pair accuracy. Yet investigators should be warned that the correction process perturbs gene family sizes and isoform diversity. This work provides guidelines on which (or whether) error-correction tools should be used, depending on the application type.Benchmarking softwarehttps://gitlab.com/leoisl/LR_EC_analyser

Download Full-text

HECIL: A Hybrid Error Correction Algorithm for Long Reads with Iterative Learning

10.1101/162917 ◽

2017 ◽

Author(s):

Olivia Choudhury ◽

Ankush Chakrabarty ◽

Scott J. Emrich

Keyword(s):

Error Correction ◽

Real Data ◽

Error Rates ◽

Iterative Learning ◽

Sequencing Error ◽

Full Potential ◽

Long Reads ◽

Second Generation Sequencing ◽

Sequencing Platforms ◽

Generation Sequencing

AbstractSecond-generation sequencing techniques generate short reads that can result in fragmented genome assemblies. Third-generation sequencing platforms mitigate this limitation by producing longer reads that span across complex and repetitive regions. Currently, the usefulness of such long reads is limited, however, because of high sequencing error rates. To exploit the full potential of these longer reads, it is imperative to correct the underlying errors. We propose HECIL—Hybrid Error Correction with Iterative Learning—a hybrid error correction framework that determines a correction policy for erroneous long reads, based on optimal combinations of decision weights obtained from short read alignments. We demonstrate that HECIL outperforms state-of-the-art error correction algorithms for an overwhelming majority of evaluation metrics on diverse real data sets includingE. coli,S. cerevisiae, and the malaria vector mosquitoA. funestus. We further improve the performance of HECIL by introducing an iterative learning paradigm that improves the correction policy at each iteration by incorporating knowledge gathered from previous iterations via confidence metrics assigned to prior corrections.Availability and Implementationhttps://github.com/NDBL/[email protected]

Download Full-text

Scalable long read self-correction and assembly polishing with multiple sequence alignment

Scientific Reports ◽

10.1038/s41598-020-80757-5 ◽

2021 ◽

Vol 11 (1) ◽

Author(s):

Pierre Morisse ◽

Camille Marchet ◽

Antoine Limasset ◽

Thierry Lecroq ◽

Arnaud Lefebvre

Keyword(s):

Sequence Alignment ◽

Multiple Sequence Alignment ◽

Correction Method ◽

Error Rates ◽

Multiple Sequence ◽

De Bruijn Graphs ◽

Sequencing Technologies ◽

Long Reads ◽

Long Read ◽

Human Dataset

AbstractThird-generation sequencing technologies allow to sequence long reads of tens of kbp, that are expected to solve various problems. However, they display high error rates, currently capped around 10%. Self-correction is thus regularly used in long reads analysis projects. We introduce CONSENT, a new self-correction method that relies both on multiple sequence alignment and local de Bruijn graphs. To ensure scalability, multiple sequence alignment computation benefits from a new and efficient segmentation strategy, allowing a massive speedup. CONSENT compares well to the state-of-the-art, and performs better on real Oxford Nanopore data. Specifically, CONSENT is the only method that efficiently scales to ultra-long reads, and allows to process a full human dataset, containing reads reaching up to 1.5 Mbp, in 10 days. Moreover, our experiments show that error correction with CONSENT improves the quality of Flye assemblies. Additionally, CONSENT implements a polishing feature, allowing to correct raw assemblies. Our experiments show that CONSENT is 2-38x times faster than other polishing tools, while providing comparable results. Furthermore, we show that, on a human dataset, assembling the raw data and polishing the assembly is less resource consuming than correcting and then assembling the reads, while providing better results. CONSENT is available at https://github.com/morispi/CONSENT.

Download Full-text

Noise-Cancelling Repeat Finder: Uncovering tandem repeats in error-prone long-read sequencing data

10.1101/475194 ◽

2018 ◽

Cited By ~ 1

Author(s):

Robert S. Harris ◽

Monika Cechova ◽

Kateryna D. Makova

Keyword(s):

Tandem Repeats ◽

Error Rates ◽

Superior Performance ◽

Whole Genome Sequencing Data ◽

Dna Repeats ◽

Sequencing Data ◽

Heat Shock Stress ◽

Noise Cancelling ◽

Long Reads ◽

Long Read

ABSTRACTSummaryTandem DNA repeats can be sequenced with long-read technologies, but cannot be accurately deciphered due to the lack of computational tools taking high error rates of these technologies into account. Here we introduce Noise-Cancelling Repeat Finder (NCRF) to uncover putative tandem repeats of specified motifs in noisy long reads produced by Pacific Biosciences and Oxford Nanopore sequencers. Using simulations, we validated the use of NCRF to locate tandem repeats with motifs of various lengths and demonstrated its superior performance as compared to two alternative tools. Using real human whole-genome sequencing data, NCRF identified long arrays of the (AATGG)n repeat involved in heat shock stress response.Availability and implementationNCRF is implemented in C, supported by several python scripts. Source code, under the MIT open source license, and simulation data are available at https://github.com/makovalab-psu/NoiseCancellingRepeatFinder, and also in bioconda.

Download Full-text

GPU accelerated partial order multiple sequence alignment for long reads self-correction

10.1101/2020.02.14.946939 ◽

2020 ◽

Author(s):

Francesco Peverelli ◽

Lorenzo Di Tucci ◽

Marco D. Santambrogio ◽

Nan Ding ◽

Steven Hofmeyr ◽

...

Keyword(s):

Error Correction ◽

Partial Order ◽

Sequence Alignment ◽

Multiple Sequence Alignment ◽

Consensus Sequence ◽

Pairwise Alignment ◽

Multiple Sequence ◽

Graph Alignment ◽

Correction Process ◽

Long Reads

AbstractAs third generation sequencing technologies become more reliable and widely used to solve several genome-related problems, self-correction of long reads is becoming the preferred method to reduce the error rate of Pacific Biosciences and Oxford Nanopore long reads, that is now around 10-12%. Several of these self-correction methods rely on some form of Multiple Sequence Alignment (MSA) to obtain a consensus sequence for the original reads. In particular, error-correction tools such as RACON and CONSENT use Partial Order (PO) graph alignment to accomplish this task. PO graph alignment, which is computationally more expensive than optimal global pairwise alignment between two sequences, needs to be performed several times for each read during the error correction process. GPUs have proven very effective in accelerating several compute-intensive tasks in different scientific fields. We harnessed the power of these architectures to accelerate the error correction process of existing self-correction tools, to improve the efficiency of this step of genome analysis.In this paper, we introduce a GPU-accelerated version of the PO alignment presented in the POA v2 software library, implemented on an NVIDIA Tesla V100 GPU. We obtain up to 6.5x speedup compared to 64 CPU threads run on two 2.3 GHz 16-core Intel Xeon Processors E5-2698 v3. In our implementation we focused on the alignment of smaller sequences, as the CONSENT segmentation strategy based on k-mer chaining provides an optimal opportunity to exploit the parallel-processing power of GPUs. To demonstrate this, we have integrated our kernel in the CONSENT software. This accelerated version of CONSENT provides a speedup for the whole error correction step that ranges from 1.95x to 8.5x depending on the input reads.

Download Full-text