scholarly journals Hercules: a profile HMM-based hybrid error correction algorithm for long reads

2017 ◽  
Author(s):  
Can Firtina ◽  
Ziv Bar-Joseph ◽  
Can Alkan ◽  
A. Ercument Cicek

AbstractMotivationChoosing whether to use second or third generation sequencing platforms can lead to trade-offs between accuracy and read length. Several studies require long and accurate reads including de novo assembly, fusion and structural variation detection. In such cases researchers often combine both technologies and the more erroneous long reads are corrected using the short reads. Current approaches rely on various graph based alignment techniques and do not take the error profile of the underlying technology into account. Memory- and time-efficient machine learning algorithms that address these shortcomings have the potential to achieve better and more accurate integration of these two technologies.ResultsWe designed and developed Hercules, the first machine learning-based long read error correction algorithm. The algorithm models every long read as a profile Hidden Markov Model with respect to the underlying platform’s error profile. The algorithm learns a posterior transition/emission probability distribution for each long read and uses this to correct errors in these reads. Using datasets from two DNA-seq BAC clones (CH17-157L1 and CH17-227A2), and human brain cerebellum polyA RNA-seq, we show that Hercules-corrected reads have the highest mapping rate among all competing algorithms and highest accuracy when most of the basepairs of a long read are covered with short reads.AvailabilityHercules source code is available at https://github.com/BilkentCompGen/Hercules

BMC Genomics ◽  
2019 ◽  
Vol 20 (S11) ◽  
Author(s):  
Arghya Kusum Das ◽  
Sayan Goswami ◽  
Kisung Lee ◽  
Seung-Jong Park

Abstract Background Long-read sequencing has shown the promises to overcome the short length limitations of second-generation sequencing by providing more complete assembly. However, the computation of the long sequencing reads is challenged by their higher error rates (e.g., 13% vs. 1%) and higher cost ($0.3 vs. $0.03 per Mbp) compared to the short reads. Methods In this paper, we present a new hybrid error correction tool, called ParLECH (Parallel Long-read Error Correction using Hybrid methodology). The error correction algorithm of ParLECH is distributed in nature and efficiently utilizes the k-mer coverage information of high throughput Illumina short-read sequences to rectify the PacBio long-read sequences.ParLECH first constructs a de Bruijn graph from the short reads, and then replaces the indel error regions of the long reads with their corresponding widest path (or maximum min-coverage path) in the short read-based de Bruijn graph. ParLECH then utilizes the k-mer coverage information of the short reads to divide each long read into a sequence of low and high coverage regions, followed by a majority voting to rectify each substituted error base. Results ParLECH outperforms latest state-of-the-art hybrid error correction methods on real PacBio datasets. Our experimental evaluation results demonstrate that ParLECH can correct large-scale real-world datasets in an accurate and scalable manner. ParLECH can correct the indel errors of human genome PacBio long reads (312 GB) with Illumina short reads (452 GB) in less than 29 h using 128 compute nodes. ParLECH can align more than 92% bases of an E. coli PacBio dataset with the reference genome, proving its accuracy. Conclusion ParLECH can scale to over terabytes of sequencing data using hundreds of computing nodes. The proposed hybrid error correction methodology is novel and rectifies both indel and substitution errors present in the original long reads or newly introduced by the short reads.


2015 ◽  
Author(s):  
Sara Goodwin ◽  
James Gurtowski ◽  
Scott Ethe-Sayers ◽  
Panchajanya Deshpande ◽  
Michael Schatz ◽  
...  

Monitoring the progress of DNA molecules through a membrane pore has been postulated as a method for sequencing DNA for several decades. Recently, a nanopore-based sequencing instrument, the Oxford Nanopore MinION, has become available that we used for sequencing the S. cerevisiae genome. To make use of these data, we developed a novel open-source hybrid error correction algorithm Nanocorr (https://github.com/jgurtowski/nanocorr) specifically for Oxford Nanopore reads, as existing packages were incapable of assembling the long read lengths (5-50kbp) at such high error rate (between ~5 and 40% error). With this new method we were able to perform a hybrid error correction of the nanopore reads using complementary MiSeq data and produce a de novo assembly that is highly contiguous and accurate: the contig N50 length is more than ten-times greater than an Illumina-only assembly (678kb versus 59.9kbp), and has greater than 99.88% consensus identity when compared to the reference. Furthermore, the assembly with the long nanopore reads presents a much more complete representation of the features of the genome and correctly assembles gene cassettes, rRNAs, transposable elements, and other genomic features that were almost entirely absent in the Illumina-only assembly.


BMC Genomics ◽  
2020 ◽  
Vol 21 (S6) ◽  
Author(s):  
Haowen Zhang ◽  
Chirag Jain ◽  
Srinivas Aluru

Abstract Background Third-generation single molecule sequencing technologies can sequence long reads, which is advancing the frontiers of genomics research. However, their high error rates prohibit accurate and efficient downstream analysis. This difficulty has motivated the development of many long read error correction tools, which tackle this problem through sampling redundancy and/or leveraging accurate short reads of the same biological samples. Existing studies to asses these tools use simulated data sets, and are not sufficiently comprehensive in the range of software covered or diversity of evaluation measures used. Results In this paper, we present a categorization and review of long read error correction methods, and provide a comprehensive evaluation of the corresponding long read error correction tools. Leveraging recent real sequencing data, we establish benchmark data sets and set up evaluation criteria for a comparative assessment which includes quality of error correction as well as run-time and memory usage. We study how trimming and long read sequencing depth affect error correction in terms of length distribution and genome coverage post-correction, and the impact of error correction performance on an important application of long reads, genome assembly. We provide guidelines for practitioners for choosing among the available error correction tools and identify directions for future research. Conclusions Despite the high error rate of long reads, the state-of-the-art correction tools can achieve high correction quality. When short reads are available, the best hybrid methods outperform non-hybrid methods in terms of correction quality and computing resource usage. When choosing tools for use, practitioners are suggested to be careful with a few correction tools that discard reads, and check the effect of error correction tools on downstream analysis. Our evaluation code is available as open-source at https://github.com/haowenz/LRECE.


2017 ◽  
Author(s):  
Pierre Morisse ◽  
Thierry Lecroq ◽  
Arnaud Lefebvre

AbstractMotivationThe recent rise of long read sequencing technologies such as Pacific Biosciences and Oxford Nanopore allows to solve assembly problems for larger and more complex genomes than what allowed short reads technologies. However, these long reads are very noisy, reaching an error rate of around 10 to 15% for Pacific Biosciences, and up to 30% for Oxford Nanopore. The error correction problem has been tackled by either self-correcting the long reads, or using complementary short reads in a hybrid approach, but most methods only focus on Pacific Biosciences data, and do not apply to Oxford Nanopore reads. Moreover, even though recent chemistries from Oxford Nanopore promise to lower the error rate below 15%, it is still higher in practice, and correcting such noisy long reads remains an issue.ResultsWe present HG-CoLoR, a hybrid error correction method that focuses on a seed-and-extend approach based on the alignment of the short reads to the long reads, followed by the traversal of a variable-order de Bruijn graph, built from the short reads. Our experiments show that HG-CoLoR manages to efficiently correct Oxford Nanopore long reads that display an error rate as high as 44%. When compared to other state-of-the-art long read error correction methods able to deal with Oxford Nanopore data, our experiments also show that HG-CoLoR provides the best trade-off between runtime and quality of the results, and is the only method able to efficiently scale to eukaryotic genomes.Availability and implementationHG-CoLoR is implemented is C++, supported on Linux platforms and freely available at https://github.com/morispi/HG-CoLoRContact: [email protected] informationSupplementary data are available at Bioinformatics online.


2021 ◽  
Vol 2021 ◽  
pp. 1-10
Author(s):  
Jie Gao

In order to overcome the problems of low error capture accuracy and long response time of traditional spoken French error correction algorithms, this study designed a French spoken error correction algorithm based on machine learning. Based on the construction of the French spoken pronunciation signal model, the algorithm analyzes the spectral features of French spoken pronunciation and then selects and classifies the features and captures the abnormal pronunciation signals. Based on this, the machine learning network architecture and the training process of the machine learning network are designed, and the operation structure of the algorithm, the algorithm program, the algorithm development environment, and the identification of oral errors are designed to complete the correction of oral French errors. Experimental results show that the proposed algorithm has high error capture accuracy and short response time, which prove its high efficiency and timeliness.


2016 ◽  
Author(s):  
A. Bernardo Carvalho ◽  
Eduardo G Dupim ◽  
Gabriel Nassar

Genome assembly depends critically on read length. Two recent technologies, PacBio and Oxford Nanopore, produce read lengths above 20 kb, which yield genome assemblies that are vastly superior to those based on Sanger or short-reads. However, the very high error rates of both technologies (around 15%-20%) makes assembly computationally expensive and imprecise at repeats longer than the read length. Here we show that the efficiency and quality of the assembly of these noisy reads can be significantly improved at a minimal cost, by leveraging on the low error rate and low cost of Illumina short reads. Namely, k-mers from the PacBio raw reads that are not present in the Illumina reads (which account for ~95% of the distinct k-mers) are deemed as sequencing errors and ignored at the seed alignment step. By focusing on ~5% of the k-mers which are error-free, read overlap sensitivity is dramatically increased. Equally important, the validation procedure can be extended to exclude repetitive k-mers, which avoids read miscorrection at repeats and further improve the resulting assemblies. We tested the k-mer validation procedure in one long-read technology (PacBio) and one assembler (MHAP/ Celera Assembler), but is likely to yield analogous improvements with alternative long-read technologies and overlappers, such as Oxford Nanopore and BLASR/DAligner.


Sign in / Sign up

Export Citation Format

Share Document