Comparative assessment of long-read error-correction software applied to RNA-sequencing data

AbstractMotivationLong-read sequencing technologies offer promising alternatives to high-throughput short read sequencing, especially in the context of RNA-sequencing. However these technologies are currently hindered by high error rates in the output data that affect analyses such as the identification of isoforms, exon boundaries, open reading frames, and the creation of gene catalogues. Due to the novelty of such data, computational methods are still actively being developed and options for the error-correction of RNA-sequencing long reads remain limited.ResultsIn this article, we evaluate the extent to which existing long-read DNA error correction methods are capable of correcting cDNA Nanopore reads. We provide an automatic and extensive benchmark tool that not only reports classical error-correction metrics but also the effect of correction on gene families, isoform diversity, bias towards the major isoform, and splice site detection. We find that long read error-correction tools that were originally developed for DNA are also suitable for the correction of RNA-sequencing data, especially in terms of increasing base-pair accuracy. Yet investigators should be warned that the correction process perturbs gene family sizes and isoform diversity. This work provides guidelines on which (or whether) error-correction tools should be used, depending on the application type.Benchmarking softwarehttps://gitlab.com/leoisl/LR_EC_analyser

Download Full-text

Comparative assessment of long-read error correction software applied to Nanopore RNA-sequencing data

Briefings in Bioinformatics ◽

10.1093/bib/bbz058 ◽

2019 ◽

Vol 21 (4) ◽

pp. 1164-1181 ◽

Cited By ~ 9

Author(s):

Leandro Lima ◽

Camille Marchet ◽

Ségolène Caboche ◽

Corinne Da Silva ◽

Benjamin Istace ◽

...

Keyword(s):

Error Correction ◽

Rna Sequencing ◽

Gene Families ◽

Error Rates ◽

Open Reading Frames ◽

Sequencing Data ◽

Isoform Diversity ◽

Long Reads ◽

Long Read ◽

Read Error Correction

Abstract Motivation Nanopore long-read sequencing technology offers promising alternatives to high-throughput short read sequencing, especially in the context of RNA-sequencing. However this technology is currently hindered by high error rates in the output data that affect analyses such as the identification of isoforms, exon boundaries, open reading frames and creation of gene catalogues. Due to the novelty of such data, computational methods are still actively being developed and options for the error correction of Nanopore RNA-sequencing long reads remain limited. Results In this article, we evaluate the extent to which existing long-read DNA error correction methods are capable of correcting cDNA Nanopore reads. We provide an automatic and extensive benchmark tool that not only reports classical error correction metrics but also the effect of correction on gene families, isoform diversity, bias toward the major isoform and splice site detection. We find that long read error correction tools that were originally developed for DNA are also suitable for the correction of Nanopore RNA-sequencing data, especially in terms of increasing base pair accuracy. Yet investigators should be warned that the correction process perturbs gene family sizes and isoform diversity. This work provides guidelines on which (or whether) error correction tools should be used, depending on the application type. Benchmarking software https://gitlab.com/leoisl/LR_EC_analyser

Download Full-text

Illuminating the dark side of the human transcriptome with long read transcript sequencing

10.21203/rs.3.rs-23156/v1 ◽

2020 ◽

Author(s):

Richard Kuo ◽

Yuanyuan Cheng ◽

Runxuan Zhang ◽

John W.S. Brown ◽

Jacqueline Smith ◽

...

Keyword(s):

Error Correction ◽

Rna Sequencing ◽

Dark Side ◽

Sequencing Data ◽

Human Transcriptome ◽

Sequencing Technologies ◽

Long Read ◽

Read Error Correction ◽

Gene Models ◽

Genome Annotations

Abstract Background The human transcriptome annotation is regarded as one of the most complete of any eukaryotic species. However, limitations in sequencing technologies have biased the annotation toward multi-exonic protein coding genes. Accurate high-throughput long read transcript sequencing can now provide stronger evidence for genes that were previously either undetectable or impossible to differentiate from sequencing noise such as rare transcripts, mono-exonic, and non-coding genes.Results We analyzed Sequel II Iso-Seq sequencing data of the Universal Human Reference RNA (UHRR) using the Transcriptome Annotation by Modular Algorithms (TAMA) software. We found that the convention of using mapping identity to measure error correction performance does not reflect actual gain in accuracy of predicted transcript models. In addition, inter-read error correction leads to the thousands of erroneous gene models. Using genome assembly based error correction and gene feature evidence, we identified thousands of potentially functional novel genes.Conclusions The standard of using inter-read error correction for long read RNA sequencing data could be responsible for genome annotations with thousands of biologically inaccurate gene models. More than half of all real genes in the human genome may still be missing in current public annotations. We require better methods for differentiating sequencing noise from real genes in long read RNA sequencing data.

Download Full-text

De novo Nanopore read quality improvement using deep learning

BMC Bioinformatics ◽

10.1186/s12859-019-3103-z ◽

2019 ◽

Vol 20 (1) ◽

Cited By ~ 4

Author(s):

Nathan LaPierre ◽

Rob Egan ◽

Wei Wang ◽

Zhong Wang

Keyword(s):

Error Correction ◽

Genome Assembly ◽

Large Scale ◽

De Novo ◽

Error Rates ◽

De Novo Genome Assembly ◽

Sequencing Technologies ◽

Oxford Nanopore ◽

Long Read ◽

Read Error Correction

Abstract Background Long read sequencing technologies such as Oxford Nanopore can greatly decrease the complexity of de novo genome assembly and large structural variation identification. Currently Nanopore reads have high error rates, and the errors often cluster into low-quality segments within the reads. The limited sensitivity of existing read-based error correction methods can cause large-scale mis-assemblies in the assembled genomes, motivating further innovation in this area. Results Here we developed a Convolutional Neural Network (CNN) based method, called MiniScrub, for identification and subsequent “scrubbing” (removal) of low-quality Nanopore read segments to minimize their interference in downstream assembly process. MiniScrub first generates read-to-read overlaps via MiniMap2, then encodes the overlaps into images, and finally builds CNN models to predict low-quality segments. Applying MiniScrub to real world control datasets under several different parameters, we show that it robustly improves read quality, and improves read error correction in the metagenome setting. Compared to raw reads, de novo genome assembly with scrubbed reads produces many fewer mis-assemblies and large indel errors. Conclusions MiniScrub is able to robustly improve read quality of Oxford Nanopore reads, especially in the metagenome setting, making it useful for downstream applications such as de novo assembly. We propose MiniScrub as a tool for preprocessing Nanopore reads for downstream analyses. MiniScrub is open-source software and is available at https://bitbucket.org/berkeleylab/jgi-miniscrub.

Download Full-text

Long-read error correction: a survey and qualitative comparison

10.1101/2020.03.06.977975 ◽

2020 ◽

Cited By ~ 2

Author(s):

Pierre Morisse ◽

Thierry Lecroq ◽

Arnaud Lefebvre

Keyword(s):

Error Correction ◽

Second Generation ◽

Hybrid Approach ◽

Error Rates ◽

Sequencing Technology ◽

Sequencing Technologies ◽

Long Reads ◽

Long Read ◽

Read Error Correction ◽

Generation Sequencing

AbstractThird generation sequencing technologies Pacific Biosciences and Oxford Nanopore Technologies were respectively made available in 2011 and 2014. In contrast with second generation sequencing technologies such as Illumina, these new technologies allow the sequencing of long reads of tens to hundreds of kbps. These so called long reads are particularly promising, and are especially expected to solve various problems such as contig and haplotype assembly or scaffolding, for instance. However, these reads are also much more error prone than second generation reads, and display error rates reaching 10 to 30%, according to the sequencing technology and to the version of the chemistry. Moreover, these errors are mainly composed of insertions and deletions, whereas most errors are substitutions in Illumina reads. As a result, long reads require efficient error correction, and a plethora of error correction tools, directly targeted at these reads, were developed in the past nine years. These methods can adopt a hybrid approach, using complementary short reads to perform correction, or a self-correction approach, only making use of the information contained in the long reads sequences. Both these approaches make use of various strategies such as multiple sequence alignment, de Bruijn graphs, hidden Markov models, or even combine different strategies. In this paper, we describe a complete survey of long-read error correction, reviewing all the different methodologies and tools existing up to date, for both hybrid and self-correction. Moreover, the long reads characteristics, such as sequencing depth, length, error rate, or even sequencing technology, can have an impact on how well a given tool or strategy performs, and can thus drastically reduce the correction quality. We thus also present an in-depth benchmark of available long-read error correction tools, on a wide variety of datasets, composed of both simulated and real data, with various error rates, coverages, and read lengths, ranging from small bacterial to large mammal genomes.

Download Full-text

Detecting and phasing minor single-nucleotide variants from long-read sequencing data

Nature Communications ◽

10.1038/s41467-021-23289-4 ◽

2021 ◽

Vol 12 (1) ◽

Author(s):

Zhixing Feng ◽

Jose C. Clemente ◽

Brandon Wong ◽

Eric E. Schadt

Keyword(s):

Genetic Heterogeneity ◽

Error Rates ◽

Metagenomic Data ◽

Sequencing Data ◽

Single Nucleotide Variants ◽

Single Nucleotide ◽

Sequencing Technologies ◽

Oxford Nanopore ◽

Long Read ◽

Technological Limitations

AbstractCellular genetic heterogeneity is common in many biological conditions including cancer, microbiome, and co-infection of multiple pathogens. Detecting and phasing minor variants play an instrumental role in deciphering cellular genetic heterogeneity, but they are still difficult tasks because of technological limitations. Recently, long-read sequencing technologies, including those by Pacific Biosciences and Oxford Nanopore, provide an opportunity to tackle these challenges. However, high error rates make it difficult to take full advantage of these technologies. To fill this gap, we introduce iGDA, an open-source tool that can accurately detect and phase minor single-nucleotide variants (SNVs), whose frequencies are as low as 0.2%, from raw long-read sequencing data. We also demonstrate that iGDA can accurately reconstruct haplotypes in closely related strains of the same species (divergence ≥0.011%) from long-read metagenomic data.

Download Full-text

Detecting and phasing minor single-nucleotide variants from long-read sequencing data

10.1101/2020.09.25.314252 ◽

2020 ◽

Author(s):

Zhixing Feng ◽

Jose Clemente ◽

Brandon Wong ◽

Eric E. Schadt

Keyword(s):

Genetic Heterogeneity ◽

Error Rates ◽

Metagenomic Data ◽

Significant Advance ◽

Sequencing Data ◽

Single Nucleotide Variants ◽

Single Nucleotide ◽

Sequencing Technologies ◽

Oxford Nanopore ◽

Long Read

AbstractCellular genetic heterogeneity is common in many biological conditions including cancer, microbiome, co-infection of multiple pathogens. Detecting and phasing minor variants, which is to determine whether multiple variants are from the same haplotype, play an instrumental role in deciphering cellular genetic heterogeneity, but are still difficult because of technological limitations. Recently, long-read sequencing technologies, including those by Pacific Biosciences and Oxford Nanopore, have provided an unprecedented opportunity to tackle these challenges. However, high error rates make it difficult to take full advantage of these technologies. To fill this gap, we introduce iGDA, an open-source tool that can accurately detect and phase minor single-nucleotide variants (SNVs), whose frequencies are as low as 0.2%, from raw long-read sequencing data. We also demonstrated that iGDA can accurately reconstruct haplotypes in closely-related strains of the same species (divergence ≥ 0.011%) from long-read metagenomic data. Our approach, therefore, presents a significant advance towards the complete deciphering of cellular genetic heterogeneity.

Download Full-text

Novel splicing and open reading frames revealed by long-read direct RNA sequencing of adenovirus transcripts

10.1101/2019.12.13.876037 ◽

2019 ◽

Cited By ~ 2

Author(s):

Alexander M. Price ◽

Katharina E. Hayer ◽

Daniel P. Depledge ◽

Angus C. Wilson ◽

Matthew D. Weitzman

Keyword(s):

Rna Sequencing ◽

Open Reading Frames ◽

Short Read ◽

Sequencing Technologies ◽

Protein Functions ◽

Long Read ◽

Cleavage And Polyadenylation ◽

Polyadenylation Sites ◽

Splice Junctions ◽

Reading Frames

AbstractAdenovirus is a common human pathogen that relies on host cell processes for production and processing of viral RNA. Although adenoviral promoters, splice junctions, and cleavage and polyadenylation sites have been characterized using low-throughput biochemical techniques or short read cDNA-based sequencing, these technologies do not fully capture the complexity of the adenoviral transcriptome. By combining Illumina short-read and nanopore long-read direct RNA sequencing approaches, we mapped transcription start sites and cleavage and polyadenylation sites across the adenovirus genome. The canonical viral early and late RNA cassettes were confirmed, but analysis of splice junctions within long RNA reads revealed an additional 20 novel viral transcripts. These RNAs include seven new splice junctions which lead to expression of canonical open reading frames (ORF), as well as 13 transcripts encoding for messages that potentially alter protein functions through truncations or the fusion of canonical ORFs. In addition, we also detect RNAs that bypass canonical cleavage sites and generate potential chimeric proteins by linking separate gene transcription units. Our work highlights how long-read sequencing technologies can reveal further complexity within viral transcriptomes.

Download Full-text

Error correction enables use of Oxford Nanopore technology for reference-free transcriptome analysis

10.1101/2020.01.07.897512 ◽

2020 ◽

Cited By ~ 3

Author(s):

Kristoffer Sahlin ◽

Botond Sipos ◽

Phillip L James ◽

Paul Medvedev

Keyword(s):

Error Correction ◽

Transcriptome Analysis ◽

Transcription Initiation ◽

Cost Effective ◽

Error Rates ◽

Computational Method ◽

Bias Error ◽

Sequencing Data ◽

Oxford Nanopore ◽

Long Read

AbstractOxford Nanopore (ONT) is a leading long-read technology which has been revolutionizing transcriptome analysis through its capacity to sequence the majority of transcripts from end-to-end. This has greatly increased our ability to study the diversity of transcription mechanisms such as transcription initiation, termination, and alternative splicing. However, ONT still suffers from high error rates which have thus far limited its scope to reference-based analyses. When a reference is not available or is not a viable option due to reference-bias, error correction is a crucial step towards the reconstruction of the sequenced transcripts and downstream sequence analysis of transcripts. In this paper, we present a novel computational method to error correct ONT cDNA sequencing data, called isONcorrect. IsONcorrect is able to jointly use all isoforms from a gene during error correction, thereby allowing it to correct reads at low sequencing depths. We are able to obtain a median accuracy of 98.9-99.6%, demonstrating the feasibility of applying cost-effective cDNA full transcript length sequencing for reference-free transcriptome analysis.

Download Full-text

NGSpeciesID: DNA barcode and amplicon consensus generation from long-read sequencing data

10.22541/au.160262406.62842291/v2 ◽

2020 ◽

Author(s):

Kristoffer Sahlin ◽

Marisa Lim ◽

Stefan Prost

Keyword(s):

High Throughput Sequencing ◽

Dna Barcode ◽

Amplicon Sequencing ◽

Error Rates ◽

Sequencing Data ◽

Sequencing Platform ◽

Consensus Sequences ◽

Sequencing Technologies ◽

Oxford Nanopore ◽

Long Read

Third generation sequencing technologies, such as Oxford Nanopore Technologies (ONT) and Pacific Biosciences (PacBio), have gained popularity over the last years. These platforms can generate millions of long read sequences. This is not only advantageous for genome sequencing projects, but also for amplicon-based high-throughput sequencing experiments, such as DNA barcoding. However, the relatively high error rates associated with these technologies still pose challenges for generating high quality consensus sequences. Here we present NGSpeciesID, a program which can generate highly accurate consensus sequences from long-read amplicon sequencing technologies, including ONT and PacBio. The tool includes clustering of the reads to help filter out contaminants or reads with high error rates and employs polishing strategies specific to the appropriate sequencing platform. We show that NGSpeciesID produces consensus sequences with improved usability by minimizing preprocessing and software installation and scalability by enabling rapid processing of hundreds to thousands of samples, while maintaining similar consensus accuracy as current pipelines

Download Full-text

Error correction enables use of Oxford Nanopore technology for reference-free transcriptome analysis

Nature Communications ◽

10.1038/s41467-020-20340-8 ◽

2021 ◽

Vol 12 (1) ◽

Author(s):

Kristoffer Sahlin ◽

Botond Sipos ◽

Phillip L. James ◽

Paul Medvedev

Keyword(s):

Error Correction ◽

Transcriptome Analysis ◽

Transcription Initiation ◽

Cost Effective ◽

Error Rates ◽

Computational Method ◽

Bias Error ◽

Sequencing Data ◽

Oxford Nanopore ◽

Long Read

AbstractOxford Nanopore (ONT) is a leading long-read technology which has been revolutionizing transcriptome analysis through its capacity to sequence the majority of transcripts from end-to-end. This has greatly increased our ability to study the diversity of transcription mechanisms such as transcription initiation, termination, and alternative splicing. However, ONT still suffers from high error rates which have thus far limited its scope to reference-based analyses. When a reference is not available or is not a viable option due to reference-bias, error correction is a crucial step towards the reconstruction of the sequenced transcripts and downstream sequence analysis of transcripts. In this paper, we present a novel computational method to error correct ONT cDNA sequencing data, called isONcorrect. IsONcorrect is able to jointly use all isoforms from a gene during error correction, thereby allowing it to correct reads at low sequencing depths. We are able to obtain a median accuracy of 98.9–99.6%, demonstrating the feasibility of applying cost-effective cDNA full transcript length sequencing for reference-free transcriptome analysis.

Download Full-text