Long Read Error Correction Algorithm Based on the de Bruijn Graph for the Third-generation Sequencing

Abstract Background Long-read sequencing has shown the promises to overcome the short length limitations of second-generation sequencing by providing more complete assembly. However, the computation of the long sequencing reads is challenged by their higher error rates (e.g., 13% vs. 1%) and higher cost ($0.3 vs. $0.03 per Mbp) compared to the short reads. Methods In this paper, we present a new hybrid error correction tool, called ParLECH (Parallel Long-read Error Correction using Hybrid methodology). The error correction algorithm of ParLECH is distributed in nature and efficiently utilizes the k-mer coverage information of high throughput Illumina short-read sequences to rectify the PacBio long-read sequences.ParLECH first constructs a de Bruijn graph from the short reads, and then replaces the indel error regions of the long reads with their corresponding widest path (or maximum min-coverage path) in the short read-based de Bruijn graph. ParLECH then utilizes the k-mer coverage information of the short reads to divide each long read into a sequence of low and high coverage regions, followed by a majority voting to rectify each substituted error base. Results ParLECH outperforms latest state-of-the-art hybrid error correction methods on real PacBio datasets. Our experimental evaluation results demonstrate that ParLECH can correct large-scale real-world datasets in an accurate and scalable manner. ParLECH can correct the indel errors of human genome PacBio long reads (312 GB) with Illumina short reads (452 GB) in less than 29 h using 128 compute nodes. ParLECH can align more than 92% bases of an E. coli PacBio dataset with the reference genome, proving its accuracy. Conclusion ParLECH can scale to over terabytes of sequencing data using hundreds of computing nodes. The proposed hybrid error correction methodology is novel and rectifies both indel and substitution errors present in the original long reads or newly introduced by the short reads.

Download Full-text

Detection of causative agent of infectious abortion in cattle by the Third Generation Sequencing Technique (Oxford Nanopore Technology)

Veterinary Medicine Journal ◽

10.30896/0042-4846.2021.24.2.20-26 ◽

2021 ◽

Vol 24 (02) ◽

pp. 20-26

Author(s):

S.S. Zaitsev ◽

◽

M.A. Khizhnyakova ◽

V.A. Feodorova ◽

◽

...

Keyword(s):

Causative Agent ◽

Third Generation ◽

The Third ◽

Third Generation Sequencing ◽

Oxford Nanopore ◽

Sequencing Technique ◽

Generation Sequencing

Download Full-text

de novo repeat detection based on the third generation sequencing reads

2019 IEEE International Conference on Bioinformatics and Biomedicine (BIBM) ◽

10.1109/bibm47256.2019.8982959 ◽

2019 ◽

Cited By ~ 1

Author(s):

Xingyu Liao ◽

Xiankai Zhang ◽

Fang-Xiang Wu ◽

Jianxin Wang

Keyword(s):

De Novo ◽

Third Generation ◽

The Third ◽

Third Generation Sequencing ◽

Generation Sequencing ◽

Repeat Detection

Download Full-text

DBG2OLC: Efficient Assembly of Large Genomes Using Long Erroneous Reads of the Third Generation Sequencing Technologies

Scientific Reports ◽

10.1038/srep31900 ◽

2016 ◽

Vol 6 (1) ◽

Cited By ~ 133

Author(s):

Chengxi Ye ◽

Christopher M. Hill ◽

Shigang Wu ◽

Jue Ruan ◽

Zhanshan (Sam) Ma

Keyword(s):

Third Generation ◽

The Third ◽

Sequencing Technologies ◽

Third Generation Sequencing ◽

Generation Sequencing ◽

Large Genomes

Download Full-text

The study of transcriptomes of symbiotic tissue of pea using the third-generation sequencing technology Oxford Nanopore

Abstract book of the 2nd International Scientific Conference "Plants and Microbes: the Future of Biotechnology" PLAMIC2020 ◽

10.28983/plamic2020.093 ◽

2020 ◽

Author(s):

E. S. Gribchenko

Keyword(s):

Nanopore Sequencing ◽

Third Generation ◽

Nitrogen Fixing ◽

Sequencing Technology ◽

The Third ◽

Third Generation Sequencing ◽

Oxford Nanopore ◽

Mycorrhizal Roots ◽

Gene Isoforms ◽

Generation Sequencing

The transcriptome profiles the cv. Frisson mycorrhizal roots and inoculated nitrogen-fixing nodules were investigated using the Oxford Nanopore sequencing technology. A database of gene isoforms and their expression has been created.

Download Full-text

Oxford Nanopore sequencing: new opportunities for plant genomics?

Journal of Experimental Botany ◽

10.1093/jxb/eraa263 ◽

2020 ◽

Vol 71 (18) ◽

pp. 5313-5322 ◽

Cited By ~ 2

Author(s):

Kathryn Dumschott ◽

Maximilian H-W Schmidt ◽

Harmeet Singh Chawla ◽

Rod Snowdon ◽

Björn Usadel

Keyword(s):

Plant Genome ◽

Third Generation ◽

Plant Genomics ◽

High Coverage ◽

Plant Genomes ◽

Sequencing Technologies ◽

Third Generation Sequencing ◽

Oxford Nanopore ◽

Long Read ◽

Generation Sequencing

Abstract DNA sequencing was dominated by Sanger’s chain termination method until the mid-2000s, when it was progressively supplanted by new sequencing technologies that can generate much larger quantities of data in a shorter time. At the forefront of these developments, long-read sequencing technologies (third-generation sequencing) can produce reads that are several kilobases in length. This greatly improves the accuracy of genome assemblies by spanning the highly repetitive segments that cause difficulty for second-generation short-read technologies. Third-generation sequencing is especially appealing for plant genomes, which can be extremely large with long stretches of highly repetitive DNA. Until recently, the low basecalling accuracy of third-generation technologies meant that accurate genome assembly required expensive, high-coverage sequencing followed by computational analysis to correct for errors. However, today’s long-read technologies are more accurate and less expensive, making them the method of choice for the assembly of complex genomes. Oxford Nanopore Technologies (ONT), a third-generation platform for the sequencing of native DNA strands, is particularly suitable for the generation of high-quality assemblies of highly repetitive plant genomes. Here we discuss the benefits of ONT, especially for the plant science community, and describe the issues that remain to be addressed when using ONT for plant genome sequencing.

Download Full-text

LongQC: A Quality Control Tool for Third Generation Sequencing Long Read Data

G3 Genes|Genome|Genetics ◽

10.1534/g3.119.400864 ◽

2020 ◽

Vol 10 (4) ◽

pp. 1193-1196

Author(s):

Yoshinori Fukasawa ◽

Luca Ermini ◽

Hai Wang ◽

Karen Carty ◽

Min-Sin Cheung

Keyword(s):

Quality Control ◽

Third Generation ◽

Third Generation Sequencing ◽

Oxford Nanopore ◽

Quality Control Tool ◽

Long Read ◽

Automated Quality Control ◽

Oxford Nanopore Technologies ◽

Generation Sequencing ◽

Control Tool

We propose LongQC as an easy and automated quality control tool for genomic datasets generated by third generation sequencing (TGS) technologies such as Oxford Nanopore technologies (ONT) and SMRT sequencing from Pacific Bioscience (PacBio). Key statistics were optimized for long read data, and LongQC covers all major TGS platforms. LongQC processes and visualizes those statistics automatically and quickly.

Download Full-text

A hybrid correcting method considering heterozygous variations by a comprehensive probabilistic model

BMC Genomics ◽

10.1186/s12864-020-07008-9 ◽

2020 ◽

Vol 21 (S10) ◽

Author(s):

Jiaqi Liu ◽

Jiayin Wang ◽

Xiao Xiao ◽

Xin Lai ◽

Daocheng Dai ◽

...

Keyword(s):

Error Correction ◽

Correction Method ◽

Reference Sequence ◽

Third Generation ◽

Next Generation ◽

Sequencing Data ◽

Sequencing Errors ◽

The Third ◽

Third Generation Sequencing ◽

Generation Sequencing

Abstract Background The emergence of the third generation sequencing technology, featuring longer read lengths, has demonstrated great advancement compared to the next generation sequencing technology and greatly promoted the biological research. However, the third generation sequencing data has a high level of the sequencing error rates, which inevitably affects the downstream analysis. Although the issue of sequencing error has been improving these years, large amounts of data were produced at high sequencing errors, and huge waste will be caused if they are discarded. Thus, the error correction for the third generation sequencing data is especially important. The existing error correction methods have poor performances at heterozygous sites, which are ubiquitous in diploid and polyploidy organisms. Therefore, it is a lack of error correction algorithms for the heterozygous loci, especially at low coverages. Results In this article, we propose a error correction method, named QIHC. QIHC is a hybrid correction method, which needs both the next generation and third generation sequencing data. QIHC greatly enhances the sensitivity of identifying the heterozygous sites from sequencing errors, which leads to a high accuracy on error correction. To achieve this, QIHC established a set of probabilistic models based on Bayesian classifier, to estimate the heterozygosity of a site and makes a judgment by calculating the posterior probabilities. The proposed method is consisted of three modules, which respectively generates a pseudo reference sequence, obtains the read alignments, estimates the heterozygosity the sites and corrects the read harboring them. The last module is the core module of QIHC, which is designed to fit for the calculations of multiple cases at a heterozygous site. The other two modules enable the reads mapping to the pseudo reference sequence which somehow overcomes the inefficiency of multiple mappings that adopt by the existing error correction methods. Conclusions To verify the performance of our method, we selected Canu and Jabba to compare with QIHC in several aspects. As a hybrid correction method, we first conducted a groups of experiments under different coverages of the next-generation sequencing data. QIHC is far ahead of Jabba on accuracy. Meanwhile, we varied the coverages of the third generation sequencing data and compared performances again among Canu, Jabba and QIHC. QIHC outperforms the other two methods on accuracy of both correcting the sequencing errors and identifying the heterozygous sites, especially at low coverage. We carried out a comparison analysis between Canu and QIHC on the different error rates of the third generation sequencing data. QIHC still performs better. Therefore, QIHC is superior to the existing error correction methods when heterozygous sites exist.

Download Full-text

Detecting complex indels with wide length-spectrum from the third generation sequencing data

2017 IEEE International Conference on Bioinformatics and Biomedicine (BIBM) ◽

10.1109/bibm.2017.8217965 ◽

2017 ◽

Author(s):

Xuanping Zhang ◽

Hengwei Chen ◽

Rong Zhang ◽

Jingwen Pei ◽

Yixuan Wang ◽

...

Keyword(s):

Length Spectrum ◽

Third Generation ◽

Sequencing Data ◽

The Third ◽

Third Generation Sequencing ◽

Generation Sequencing

Download Full-text

BubbleGun: Enumerating Bubbles and Superbubbles in Genome Graphs

10.1101/2021.03.23.436631 ◽

2021 ◽

Author(s):

Fawaz Dabbaghie ◽

Jana Ebler ◽

Tobias Marschall

Keyword(s):

De Novo ◽

General Purpose ◽

Supplementary Information ◽

De Bruijn Graph ◽

De Bruijn Graphs ◽

Third Generation Sequencing ◽

Human Sample ◽

Fast Development ◽

De Bruijn ◽

Generation Sequencing

AbstractMotivationWith the fast development of third generation sequencing machines, de novo genome assembly is becoming a routine even for larger genomes. Graph-based representations of genomes arise both as part of the assembly process, but also in the context of pangenomes representing a population. In both cases, polymorphic loci lead to bubble structures in such graphs. Detecting bubbles is hence an important task when working with genomic variants in the context of genome graphs.ResultsHere, we present a fast general-purpose tool, called BubbleGun, for detecting bubbles and superbubbles in genome graphs. Furthermore, BubbleGun detects and outputs runs of linearly connected bubbles and superbubbles, which we call bubble chains. We showcase its utility on de Bruijn graphs and compare our results to vg’s snarl detection. We show that BubbleGun is considerably faster than vg especially in bigger graphs, where it reports all bubbles in less than 30 minutes on a human sample de Bruijn graph of around 2 million nodes.AvailabilityBubbleGun is available and documented at https://github.com/fawaz-dabbaghieh/bubble_gun under MIT [email protected] or [email protected] informationSupplementary data are available at Bioinformatics online.

Download Full-text