Error Correction in Nanopore Reads for de novo Genomic Assembly

Monitoring the progress of DNA molecules through a membrane pore has been postulated as a method for sequencing DNA for several decades. Recently, a nanopore-based sequencing instrument, the Oxford Nanopore MinION, has become available that we used for sequencing the S. cerevisiae genome. To make use of these data, we developed a novel open-source hybrid error correction algorithm Nanocorr (https://github.com/jgurtowski/nanocorr) specifically for Oxford Nanopore reads, as existing packages were incapable of assembling the long read lengths (5-50kbp) at such high error rate (between ~5 and 40% error). With this new method we were able to perform a hybrid error correction of the nanopore reads using complementary MiSeq data and produce a de novo assembly that is highly contiguous and accurate: the contig N50 length is more than ten-times greater than an Illumina-only assembly (678kb versus 59.9kbp), and has greater than 99.88% consensus identity when compared to the reference. Furthermore, the assembly with the long nanopore reads presents a much more complete representation of the features of the genome and correctly assembles gene cassettes, rRNAs, transposable elements, and other genomic features that were almost entirely absent in the Illumina-only assembly.

Download Full-text

Athena: Automated Tuning of k-mer based Genomic Error Correction Algorithms using Language Models

Scientific Reports ◽

10.1038/s41598-019-52196-4 ◽

2019 ◽

Vol 9 (1) ◽

Author(s):

Mustafa Abdallah ◽

Ashraf Mahgoub ◽

Hany Ahmed ◽

Somali Chaterji

Keyword(s):

Error Correction ◽

Language Processing ◽

De Novo ◽

Geometric Mean ◽

Language Modeling ◽

Error Rates ◽

Language Models ◽

Hill Climbing ◽

Strong Negative Correlation ◽

Best Value

Abstract The performance of most error-correction (EC) algorithms that operate on genomics reads is dependent on the proper choice of its configuration parameters, such as the value of k in k-mer based techniques. In this work, we target the problem of finding the best values of these configuration parameters to optimize error correction and consequently improve genome assembly. We perform this in an adaptive manner, adapted to different datasets and to EC tools, due to the observation that different configuration parameters are optimal for different datasets, i.e., from different platforms and species, and vary with the EC algorithm being applied. We use language modeling techniques from the Natural Language Processing (NLP) domain in our algorithmic suite, Athena, to automatically tune the performance-sensitive configuration parameters. Through the use of N-Gram and Recurrent Neural Network (RNN) language modeling, we validate the intuition that the EC performance can be computed quantitatively and efficiently using the “perplexity” metric, repurposed from NLP. After training the language model, we show that the perplexity metric calculated from a sample of the test (or production) data has a strong negative correlation with the quality of error correction of erroneous NGS reads. Therefore, we use the perplexity metric to guide a hill climbing-based search, converging toward the best configuration parameter value. Our approach is suitable for both de novo and comparative sequencing (resequencing), eliminating the need for a reference genome to serve as the ground truth. We find that Athena can automatically find the optimal value of k with a very high accuracy for 7 real datasets and using 3 different k-mer based EC algorithms, Lighter, Blue, and Racer. The inverse relation between the perplexity metric and alignment rate exists under all our tested conditions—for real and synthetic datasets, for all kinds of sequencing errors (insertion, deletion, and substitution), and for high and low error rates. The absolute value of that correlation is at least 73%. In our experiments, the best value of k found by Athena achieves an alignment rate within 0.53% of the oracle best value of k found through brute force searching (i.e., scanning through the entire range of k values). Athena’s selected value of k lies within the top-3 best k values using N-Gram models and the top-5 best k values using RNN models With best parameter selection by Athena, the assembly quality (NG50) is improved by a Geometric Mean of 4.72X across the 7 real datasets.

Download Full-text

Hybrid error correction approach and de novo assembly for minion sequencing long reads

2016 IEEE International Conference on Bioinformatics and Biomedicine (BIBM) ◽

10.1109/bibm.2016.7822504 ◽

2016 ◽

Author(s):

Mehdi Kchouk ◽

Mourad Elloumi

Keyword(s):

Error Correction ◽

De Novo Assembly ◽

De Novo ◽

Long Reads

Download Full-text

Peer Review #1 of "NxRepair: error correction in de novo sequence assembly using Nextera mate pairs (v0.2)"

10.7287/peerj.996v0.2/reviews/1 ◽

2015 ◽

Author(s):

AE Darling

Keyword(s):

Error Correction ◽

Peer Review ◽

De Novo ◽

Sequence Assembly ◽

De Novo Sequence Assembly

Download Full-text

MECAT: an ultra-fast mapping, error correction andde novoassembly tool for single-molecule sequencing reads

10.1101/089250 ◽

2016 ◽

Cited By ~ 2

Author(s):

Chuan-Le Xiao ◽

Ying Chen ◽

Shang-qian Xie ◽

Kai-Ning Chen ◽

Yan Wang ◽

...

Keyword(s):

Error Correction ◽

Single Molecule ◽

De Novo ◽

Computational Cost ◽

Pairwise Alignment ◽

Global Alignment ◽

Chinese Han ◽

Celera Assembler ◽

Reference Quality ◽

Molecular Sequencing

ABSTRACTThe high computational cost of current assembly methods for the long, noisy single molecular sequencing (SMS) reads has prevented them from assembling large genomes. We introduce an ultra-fast alignment method based on a novel global alignment score. For large human SMS data, our method is 7X faster than MHAP for pairwise alignment and 15X faster than BLASR for reference mapping. We develop a Mapping, Error Correction and de novo Assembly Tool (MECAT) by integrating our new alignment and error correction methods, with the Celera Assembler. MECAT is capable of producing high qualityde novoassembly of large genome from SMS reads with low computational cost. MECAT produces reference-quality assemblies ofSaccharomyces cerevisiae,Arabidopsis thaliana,Drosophila melanogasterand reconstructs the human CHM1 genome with 15% longer NG50 in only 7600 CPU core hours using 54X SMS reads and a Chinese Han genome in 19200 CPU core hours using 102X SMS reads.

Download Full-text

Fast and accurate de novo genome assembly from long uncorrected reads

10.1101/068122 ◽

2016 ◽

Cited By ~ 8

Author(s):

Robert Vaser ◽

Ivan Sović ◽

Niranjan Nagarajan ◽

Mile Šikić

Keyword(s):

Error Correction ◽

De Novo ◽

High Quality ◽

De Novo Genome Assembly ◽

Consensus Sequences ◽

Long Reads ◽

Oxford Nanopore ◽

Order Of Magnitude ◽

Correction Step ◽

Consensus Module

The assembly of long reads from Pacific Biosciences and Oxford Nanopore Technologies typically requires resource intensive error correction and consensus generation steps to obtain high quality assemblies. We show that the error correction step can be omitted and high quality consensus sequences can be generated efficiently with a SIMD accelerated, partial order alignment based stand-alone consensus module called Racon. Based on tests with PacBio and Oxford Nanopore datasets we show that Racon coupled with Miniasm enables consensus genomes with similar or better quality than state-of-the-art methods while being an order of magnitude faster.Racon is available open source under the MIT license at https://github.com/isovic/racon.git.

Download Full-text

Optimizing error correction of RNAseq reads

10.1101/020123 ◽

2015 ◽

Cited By ~ 4

Author(s):

Matthew D MacManes

Keyword(s):

Error Correction ◽

De Novo ◽

Sequence Data ◽

De Novo Genome Assembly ◽

Sequencing Errors ◽

Transcriptome Sequence Data ◽

Correct Sequence ◽

Processing Step ◽

Read Error Correction

Motivation: The correction of sequencing errors contained in Illumina reads derived from genomic DNA is a common pre-processing step in many de novo genome assembly pipelines, and has been shown to improved the quality of resultant assemblies. In contrast, the correction of errors in transcriptome sequence data is much less common, but can potentially yield similar improvements in mapping and assembly quality. This manuscript evaluates several popular read-correction tool's ability to correct sequence errors commonplace to transcriptome derived Illumina reads. Results: I evaluated the efficacy of correction of transcriptome derived sequencing reads using using several metrics across a variety of sequencing depths. This evaluation demonstrates a complex relationship between the quality of the correction, depth of sequencing, and hardware availability which results in variable recommendations depending on the goals of the experiment, tolerance for false positives, and depth of coverage. Overall, read error correction is an important step in read quality control, and should become a standard part of analytical pipelines. Availability: Results are non-deterministically repeatable using AMI:ami-3dae4956 (MacManes EC 2015) and the Makefile available here: https://goo.gl/oVIuE0

Download Full-text