Hercules: a profile HMM-based hybrid error correction algorithm for long reads

AbstractMotivationChoosing whether to use second or third generation sequencing platforms can lead to trade-offs between accuracy and read length. Several studies require long and accurate reads including de novo assembly, fusion and structural variation detection. In such cases researchers often combine both technologies and the more erroneous long reads are corrected using the short reads. Current approaches rely on various graph based alignment techniques and do not take the error profile of the underlying technology into account. Memory- and time-efficient machine learning algorithms that address these shortcomings have the potential to achieve better and more accurate integration of these two technologies.ResultsWe designed and developed Hercules, the first machine learning-based long read error correction algorithm. The algorithm models every long read as a profile Hidden Markov Model with respect to the underlying platform’s error profile. The algorithm learns a posterior transition/emission probability distribution for each long read and uses this to correct errors in these reads. Using datasets from two DNA-seq BAC clones (CH17-157L1 and CH17-227A2), and human brain cerebellum polyA RNA-seq, we show that Hercules-corrected reads have the highest mapping rate among all competing algorithms and highest accuracy when most of the basepairs of a long read are covered with short reads.AvailabilityHercules source code is available at https://github.com/BilkentCompGen/Hercules

Download Full-text

A hybrid and scalable error correction algorithm for indel and substitution errors of long reads

BMC Genomics ◽

10.1186/s12864-019-6286-9 ◽

2019 ◽

Vol 20 (S11) ◽

Author(s):

Arghya Kusum Das ◽

Sayan Goswami ◽

Kisung Lee ◽

Seung-Jong Park

Keyword(s):

Error Correction ◽

Error Rates ◽

De Bruijn Graph ◽

Correction Algorithm ◽

Short Read ◽

Short Reads ◽

Long Reads ◽

Long Read ◽

De Bruijn ◽

Error Correction Algorithm

Abstract Background Long-read sequencing has shown the promises to overcome the short length limitations of second-generation sequencing by providing more complete assembly. However, the computation of the long sequencing reads is challenged by their higher error rates (e.g., 13% vs. 1%) and higher cost ($0.3 vs. $0.03 per Mbp) compared to the short reads. Methods In this paper, we present a new hybrid error correction tool, called ParLECH (Parallel Long-read Error Correction using Hybrid methodology). The error correction algorithm of ParLECH is distributed in nature and efficiently utilizes the k-mer coverage information of high throughput Illumina short-read sequences to rectify the PacBio long-read sequences.ParLECH first constructs a de Bruijn graph from the short reads, and then replaces the indel error regions of the long reads with their corresponding widest path (or maximum min-coverage path) in the short read-based de Bruijn graph. ParLECH then utilizes the k-mer coverage information of the short reads to divide each long read into a sequence of low and high coverage regions, followed by a majority voting to rectify each substituted error base. Results ParLECH outperforms latest state-of-the-art hybrid error correction methods on real PacBio datasets. Our experimental evaluation results demonstrate that ParLECH can correct large-scale real-world datasets in an accurate and scalable manner. ParLECH can correct the indel errors of human genome PacBio long reads (312 GB) with Illumina short reads (452 GB) in less than 29 h using 128 compute nodes. ParLECH can align more than 92% bases of an E. coli PacBio dataset with the reference genome, proving its accuracy. Conclusion ParLECH can scale to over terabytes of sequencing data using hundreds of computing nodes. The proposed hybrid error correction methodology is novel and rectifies both indel and substitution errors present in the original long reads or newly introduced by the short reads.

Download Full-text

Oxford Nanopore Sequencing, Hybrid Error Correction, and de novo Assembly of a Eukaryotic Genome

10.1101/013490 ◽

2015 ◽

Cited By ~ 23

Author(s):

Sara Goodwin ◽

James Gurtowski ◽

Scott Ethe-Sayers ◽

Panchajanya Deshpande ◽

Michael Schatz ◽

...

Keyword(s):

Error Correction ◽

De Novo Assembly ◽

De Novo ◽

Correction Algorithm ◽

Membrane Pore ◽

Complete Representation ◽

Oxford Nanopore ◽

Long Read ◽

Error Correction Algorithm ◽

Sequencing Instrument

Monitoring the progress of DNA molecules through a membrane pore has been postulated as a method for sequencing DNA for several decades. Recently, a nanopore-based sequencing instrument, the Oxford Nanopore MinION, has become available that we used for sequencing the S. cerevisiae genome. To make use of these data, we developed a novel open-source hybrid error correction algorithm Nanocorr (https://github.com/jgurtowski/nanocorr) specifically for Oxford Nanopore reads, as existing packages were incapable of assembling the long read lengths (5-50kbp) at such high error rate (between ~5 and 40% error). With this new method we were able to perform a hybrid error correction of the nanopore reads using complementary MiSeq data and produce a de novo assembly that is highly contiguous and accurate: the contig N50 length is more than ten-times greater than an Illumina-only assembly (678kb versus 59.9kbp), and has greater than 99.88% consensus identity when compared to the reference. Furthermore, the assembly with the long nanopore reads presents a much more complete representation of the features of the genome and correctly assembles gene cassettes, rRNAs, transposable elements, and other genomic features that were almost entirely absent in the Illumina-only assembly.

Download Full-text

A comprehensive evaluation of long read error correction methods

BMC Genomics ◽

10.1186/s12864-020-07227-0 ◽

2020 ◽

Vol 21 (S6) ◽

Author(s):

Haowen Zhang ◽

Chirag Jain ◽

Srinivas Aluru

Keyword(s):

Error Correction ◽

Hybrid Methods ◽

Comprehensive Evaluation ◽

Data Sets ◽

Short Reads ◽

Long Reads ◽

Long Read ◽

Read Error Correction ◽

Downstream Analysis ◽

The Impact

Abstract Background Third-generation single molecule sequencing technologies can sequence long reads, which is advancing the frontiers of genomics research. However, their high error rates prohibit accurate and efficient downstream analysis. This difficulty has motivated the development of many long read error correction tools, which tackle this problem through sampling redundancy and/or leveraging accurate short reads of the same biological samples. Existing studies to asses these tools use simulated data sets, and are not sufficiently comprehensive in the range of software covered or diversity of evaluation measures used. Results In this paper, we present a categorization and review of long read error correction methods, and provide a comprehensive evaluation of the corresponding long read error correction tools. Leveraging recent real sequencing data, we establish benchmark data sets and set up evaluation criteria for a comparative assessment which includes quality of error correction as well as run-time and memory usage. We study how trimming and long read sequencing depth affect error correction in terms of length distribution and genome coverage post-correction, and the impact of error correction performance on an important application of long reads, genome assembly. We provide guidelines for practitioners for choosing among the available error correction tools and identify directions for future research. Conclusions Despite the high error rate of long reads, the state-of-the-art correction tools can achieve high correction quality. When short reads are available, the best hybrid methods outperform non-hybrid methods in terms of correction quality and computing resource usage. When choosing tools for use, practitioners are suggested to be careful with a few correction tools that discard reads, and check the effect of error correction tools on downstream analysis. Our evaluation code is available as open-source at https://github.com/haowenz/LRECE.

Download Full-text

A Long read hybrid error correction algorithm based on segmented pHMM

2020 5th International Conference on Mechanical, Control and Computer Engineering (ICMCCE) ◽

10.1109/icmcce51767.2020.00329 ◽

2020 ◽

Author(s):

Hu Lanyue ◽

Chen Jianhua ◽

Wang Rongshu ◽

Lu Zhiwen ◽

Hou Bin

Keyword(s):

Error Correction ◽

Correction Algorithm ◽

Long Read ◽

Error Correction Algorithm

Download Full-text

Hercules: a profile HMM-based hybrid error correction algorithm for long reads

Nucleic Acids Research ◽

10.1093/nar/gky724 ◽

2018 ◽

Cited By ~ 2

Author(s):

Can Firtina ◽

Ziv Bar-Joseph ◽

Can Alkan ◽

A Ercument Cicek

Keyword(s):

Error Correction ◽

Correction Algorithm ◽

Long Reads ◽

Error Correction Algorithm ◽

Profile Hmm

Download Full-text

Hybrid correction of highly noisy Oxford Nanopore long reads using a variable-order de Bruijn graph

10.1101/238808 ◽

2017 ◽

Cited By ~ 3

Author(s):

Pierre Morisse ◽

Thierry Lecroq ◽

Arnaud Lefebvre

Keyword(s):

Error Correction ◽

Error Rate ◽

De Bruijn Graph ◽

Variable Order ◽

Short Reads ◽

Pacific Biosciences ◽

Long Reads ◽

Oxford Nanopore ◽

Long Read ◽

De Bruijn

AbstractMotivationThe recent rise of long read sequencing technologies such as Pacific Biosciences and Oxford Nanopore allows to solve assembly problems for larger and more complex genomes than what allowed short reads technologies. However, these long reads are very noisy, reaching an error rate of around 10 to 15% for Pacific Biosciences, and up to 30% for Oxford Nanopore. The error correction problem has been tackled by either self-correcting the long reads, or using complementary short reads in a hybrid approach, but most methods only focus on Pacific Biosciences data, and do not apply to Oxford Nanopore reads. Moreover, even though recent chemistries from Oxford Nanopore promise to lower the error rate below 15%, it is still higher in practice, and correcting such noisy long reads remains an issue.ResultsWe present HG-CoLoR, a hybrid error correction method that focuses on a seed-and-extend approach based on the alignment of the short reads to the long reads, followed by the traversal of a variable-order de Bruijn graph, built from the short reads. Our experiments show that HG-CoLoR manages to efficiently correct Oxford Nanopore long reads that display an error rate as high as 44%. When compared to other state-of-the-art long read error correction methods able to deal with Oxford Nanopore data, our experiments also show that HG-CoLoR provides the best trade-off between runtime and quality of the results, and is the only method able to efficiently scale to eukaryotic genomes.Availability and implementationHG-CoLoR is implemented is C++, supported on Linux platforms and freely available at https://github.com/morispi/HG-CoLoRContact: [email protected] informationSupplementary data are available at Bioinformatics online.

Download Full-text

HECIL: A Hybrid Error Correction Algorithm for Long Reads with Iterative Learning

Scientific Reports ◽

10.1038/s41598-018-28364-3 ◽

2018 ◽

Vol 8 (1) ◽

Cited By ~ 3

Author(s):

Olivia Choudhury ◽

Ankush Chakrabarty ◽

Scott J. Emrich

Keyword(s):

Error Correction ◽

Iterative Learning ◽

Correction Algorithm ◽

Long Reads ◽

Error Correction Algorithm

Download Full-text

EC: an efficient error correction algorithm for short reads

BMC Bioinformatics ◽

10.1186/1471-2105-16-s17-s2 ◽

2015 ◽

Vol 16 (S17) ◽

Cited By ~ 3

Author(s):

Subrata Saha ◽

Sanguthevar Rajasekaran

Keyword(s):

Error Correction ◽

Correction Algorithm ◽

Short Reads ◽

Error Correction Algorithm

Download Full-text

Research on Machine Learning-Based Error Correction Algorithm for Spoken French

Security and Communication Networks ◽

10.1155/2021/6259995 ◽

2021 ◽

Vol 2021 ◽

pp. 1-10

Author(s):

Jie Gao

Keyword(s):

Machine Learning ◽

Response Time ◽

Error Correction ◽

Network Architecture ◽

High Efficiency ◽

Correction Algorithm ◽

Signal Model ◽

Development Environment ◽

Learning Network ◽

Error Correction Algorithm

In order to overcome the problems of low error capture accuracy and long response time of traditional spoken French error correction algorithms, this study designed a French spoken error correction algorithm based on machine learning. Based on the construction of the French spoken pronunciation signal model, the algorithm analyzes the spectral features of French spoken pronunciation and then selects and classifies the features and captures the abnormal pronunciation signals. Based on this, the machine learning network architecture and the training process of the machine learning network are designed, and the operation structure of the algorithm, the algorithm program, the algorithm development environment, and the identification of oral errors are designed to complete the correction of oral French errors. Experimental results show that the proposed algorithm has high error capture accuracy and short response time, which prove its high efficiency and timeliness.

Download Full-text

Improved assembly of noisy long reads by k-mer validation

10.1101/053256 ◽

2016 ◽

Author(s):

A. Bernardo Carvalho ◽

Eduardo G Dupim ◽

Gabriel Nassar

Keyword(s):

Low Cost ◽

Error Rates ◽

Read Length ◽

Celera Assembler ◽

Short Reads ◽

Sequencing Errors ◽

Long Reads ◽

Oxford Nanopore ◽

Long Read ◽

Validation Procedure

Genome assembly depends critically on read length. Two recent technologies, PacBio and Oxford Nanopore, produce read lengths above 20 kb, which yield genome assemblies that are vastly superior to those based on Sanger or short-reads. However, the very high error rates of both technologies (around 15%-20%) makes assembly computationally expensive and imprecise at repeats longer than the read length. Here we show that the efficiency and quality of the assembly of these noisy reads can be significantly improved at a minimal cost, by leveraging on the low error rate and low cost of Illumina short reads. Namely, k-mers from the PacBio raw reads that are not present in the Illumina reads (which account for ~95% of the distinct k-mers) are deemed as sequencing errors and ignored at the seed alignment step. By focusing on ~5% of the k-mers which are error-free, read overlap sensitivity is dramatically increased. Equally important, the validation procedure can be extended to exclude repetitive k-mers, which avoids read miscorrection at repeats and further improve the resulting assemblies. We tested the k-mer validation procedure in one long-read technology (PacBio) and one assembler (MHAP/ Celera Assembler), but is likely to yield analogous improvements with alternative long-read technologies and overlappers, such as Oxford Nanopore and BLASR/DAligner.

Download Full-text