A Sequence-Based Novel Approach for Quality Evaluation of Third-Generation Sequencing Reads

The advent of third-generation sequencing (TGS) technologies, such as the Pacific Biosciences (PacBio) and Oxford Nanopore machines, provides new possibilities for contig assembly, scaffolding, and high-performance computing in bioinformatics due to its long reads. However, the high error rate and poor quality of TGS reads provide new challenges for accurate genome assembly and long-read alignment. Efficient processing methods are in need to prioritize high-quality reads for improving the results of error correction and assembly. In this study, we proposed a novel Read Quality Evaluation and Selection Tool (REQUEST) for evaluating the quality of third-generation long reads. REQUEST generates training data of high-quality and low-quality reads which are characterized by their nucleotide combinations. A linear regression model was built to score the quality of reads. The method was tested on three datasets of different species. The results showed that the top-scored reads prioritized by REQUEST achieved higher alignment accuracies. The contig assembly results based on the top-scored reads also outperformed conventional approaches that use all reads. REQUEST is able to distinguish high-quality reads from low-quality ones without using reference genomes, making it a promising alternative sequence-quality evaluation method to alignment-based algorithms.

Download Full-text

Quality of Third Generation Sequencing

Journal of Computational and Theoretical Nanoscience ◽

10.1166/jctn.2020.9630 ◽

2020 ◽

Vol 17 (12) ◽

pp. 5205-5209

Author(s):

Ali Elbialy ◽

M. A. El-Dosuky ◽

Ibrahim M. El-Henawy

Keyword(s):

Neural Network ◽

Deep Neural Network ◽

Gc Content ◽

Error Rates ◽

Third Generation ◽

Third Generation Sequencing ◽

Long Reads ◽

Generation Sequencing

Third generation sequencing (TGS) relates to long reads but with relatively high error rates. Quality of TGS is a hot topic, dealing with errors. This paper combines and investigates three quality related metrics. They are basecalling accuracy, Phred Quality Scores, and GC content. For basecalling accuracy, a deep neural network is adopted. The measured loss does not exceed 5.42.

Download Full-text

IsoDetect: Detection of splice isoforms from third generation long reads based on short feature sequences

Current Bioinformatics ◽

10.2174/1574893615666200316101205 ◽

2020 ◽

Vol 15 ◽

Author(s):

Hongdong Li ◽

Wenjing Zhang ◽

Yuwen Luo ◽

Jianxin Wang

Keyword(s):

Sequence Similarity ◽

Detection Methods ◽

Sequence Information ◽

Third Generation ◽

Sequencing Data ◽

Splice Isoforms ◽

Third Generation Sequencing ◽

Long Reads ◽

Feature Sequence ◽

Generation Sequencing

Aims: Accurately detect isoforms from third generation sequencing data. Background: Transcriptome annotation is the basis for the analysis of gene expression and regulation. The transcriptome annotation of many organisms such as humans is far from incomplete, due partly to the challenge in the identification of isoforms that are produced from the same gene through alternative splicing. Third generation sequencing (TGS) reads provide unprecedented opportunity for detecting isoforms due to their long length that exceeds the length of most isoforms. One limitation of current TGS reads-based isoform detection methods is that they are exclusively based on sequence reads, without incorporating the sequence information of known isoforms. Objective: Develop an efficient method for isoform detection. Method: Based on annotated isoforms, we propose a splice isoform detection method called IsoDetect. First, the sequence at exon-exon junction is extracted from annotated isoforms as the “short feature sequence”, which is used to distinguish different splice isoforms. Second, we aligned these feature sequences to long reads and divided long reads into groups that contain the same set of feature sequences, thereby avoiding the pair-wise comparison among the large number of long reads. Third, clustering and consensus generation are carried out based on sequence similarity. For the long reads that do not contain any short feature sequence, clustering analysis based on sequence similarity is performed to identify isoforms. Result: Tested on two datasets from Calypte Anna and Zebra Finch, IsoDetect showed higher speed and compelling accuracy compared with four existing methods. Conclusion: IsoDetect is a promising method for isoform detection. Other: This paper was accepted by the CBC2019 conference.

Download Full-text

Apollo: a sequencing-technology-independent, scalable and accurate assembly polishing algorithm

Bioinformatics ◽

10.1093/bioinformatics/btaa179 ◽

2020 ◽

Vol 36 (12) ◽

pp. 3669-3679 ◽

Cited By ~ 3

Author(s):

Can Firtina ◽

Jeremie S Kim ◽

Mohammed Alser ◽

Damla Senol Cali ◽

A Ercument Cicek ◽

...

Keyword(s):

Genome Analysis ◽

Supplementary Information ◽

Third Generation ◽

Sequencing Technology ◽

Base Pairs ◽

Sequencing Technologies ◽

Third Generation Sequencing ◽

Long Reads ◽

Generation Sequencing ◽

Large Genomes

Abstract Motivation Third-generation sequencing technologies can sequence long reads that contain as many as 2 million base pairs. These long reads are used to construct an assembly (i.e. the subject’s genome), which is further used in downstream genome analysis. Unfortunately, third-generation sequencing technologies have high sequencing error rates and a large proportion of base pairs in these long reads is incorrectly identified. These errors propagate to the assembly and affect the accuracy of genome analysis. Assembly polishing algorithms minimize such error propagation by polishing or fixing errors in the assembly by using information from alignments between reads and the assembly (i.e. read-to-assembly alignment information). However, current assembly polishing algorithms can only polish an assembly using reads from either a certain sequencing technology or a small assembly. Such technology-dependency and assembly-size dependency require researchers to (i) run multiple polishing algorithms and (ii) use small chunks of a large genome to use all available readsets and polish large genomes, respectively. Results We introduce Apollo, a universal assembly polishing algorithm that scales well to polish an assembly of any size (i.e. both large and small genomes) using reads from all sequencing technologies (i.e. second- and third-generation). Our goal is to provide a single algorithm that uses read sets from all available sequencing technologies to improve the accuracy of assembly polishing and that can polish large genomes. Apollo (i) models an assembly as a profile hidden Markov model (pHMM), (ii) uses read-to-assembly alignment to train the pHMM with the Forward–Backward algorithm and (iii) decodes the trained model with the Viterbi algorithm to produce a polished assembly. Our experiments with real readsets demonstrate that Apollo is the only algorithm that (i) uses reads from any sequencing technology within a single run and (ii) scales well to polish large assemblies without splitting the assembly into multiple parts. Availability and implementation Source code is available at https://github.com/CMU-SAFARI/Apollo. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

CoLoRd: Compressing long reads

10.1101/2021.07.17.452767 ◽

2021 ◽

Author(s):

Marek Kokot ◽

Adam Gudys ◽

Heng Li ◽

Sebastian Deorowicz

Keyword(s):

General Purpose ◽

Third Generation ◽

Sequencing Data ◽

The Third ◽

Third Generation Sequencing ◽

Long Reads ◽

Order Of Magnitude ◽

Generation Sequencing

The costs of maintaining exabytes of data produced by sequencing experiments every year has become a major issue in today's genomics. In spite of the increasing popularity of the third generation sequencing, the existing algorithms for compressing long reads exhibit minor advantage over general purpose gzip. We present CoLoRd, an algorithm able to reduce 3rd generation sequencing data by an order of magnitude without affecting the accuracy of downstream analyzes.

Download Full-text

ELECTOR: Evaluator for long reads correction methods

10.1101/512889 ◽

2019 ◽

Cited By ~ 1

Author(s):

Camille Marchet ◽

Pierre Morisse ◽

Lolita Lecompte ◽

Arnaud Lefebvre ◽

Thierry Lecroq ◽

...

Keyword(s):

Error Correction ◽

State Of The Art ◽

Error Rates ◽

Sequencing Data ◽

Third Generation Sequencing ◽

Long Reads ◽

Wide Range ◽

Downstream Processes ◽

Generation Sequencing

AbstractMotivationIn the last few years, the error rates of third generation sequencing data have been capped above 5%, including many insertions and deletions. Thereby, an increasing number of long reads correction methods have been proposed to reduce the noise in these sequences. Whether hybrid or self-correction methods, there exist multiple approaches to correct long reads. As the quality of the error correction has huge impacts on downstream processes, developing methods allowing to evaluate error correction tools with precise and reliable statistics is therefore a crucial need. Since error correction is often a resource bottleneck in long reads pipelines, a key feature of assessment methods is therefore to be efficient, in order to allow the fast comparison of different tools.ResultsWe propose ELECTOR, a reliable and efficient tool to evaluate long reads correction, that enables the evaluation of hybrid and self-correction methods. Our tool provides a complete and relevant set of metrics to assess the read quality improvement after correction and scales to large datasets. ELECTOR is directly compatible with a wide range of state-of-the-art error correction tools, using whether simulated or real long reads. We show that ELECTOR displays a wider range of metrics than the state-of-the-art tool, LRCstats, and additionally importantly decreases the runtime needed for assessment on all the studied datasets.AvailabilityELECTOR is available at https://github.com/kamimrcht/[email protected] or [email protected]

Download Full-text

TGStools: A Bioinformatics Suit to Facilitate Transcriptome Analysis of Long Reads from Third Generation Sequencing Platform

Genes ◽

10.3390/genes10070519 ◽

2019 ◽

Vol 10 (7) ◽

pp. 519

Author(s):

Danze Chen ◽

Qianqian Zhao ◽

Leiming Jiang ◽

Shuaiyuan Liao ◽

Zhigang Meng ◽

...

Keyword(s):

Transcriptome Analysis ◽

De Novo ◽

Noncoding Rnas ◽

Long Noncoding Rnas ◽

Third Generation ◽

Sequencing Platform ◽

Third Generation Sequencing ◽

Long Reads ◽

Routine Tasks ◽

Generation Sequencing

Recent analyses show that transcriptome sequencing can be utilized as a diagnostic tool for rare Mendelian diseases. The third generation sequencing de novo detects long reads of thousands of base pairs, thus greatly expanding the isoform discovery and identification of novel long noncoding RNAs. In this study, we developed TGStools, a bioinformatics suite to facilitate routine tasks such as characterizing full-length transcripts, detecting shifted types of alternative splicing, and long noncoding RNAs (lncRNAs) identification in transcriptome analysis. It also prioritizes the transcripts with a visualization framework that automatically integrates rich annotation with known genomic features. TGStools is a Python package freely available at Github.

Download Full-text

Improving protein domain classification for third-generation sequencing reads using deep learning

BMC Genomics ◽

10.1186/s12864-021-07468-7 ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Nan Du ◽

Jiayu Shang ◽

Yanni Sun

Keyword(s):

Deep Learning ◽

Dna Sequences ◽

Domain Analysis ◽

Protein Domain ◽

Analysis Tool ◽

Third Generation ◽

Third Generation Sequencing ◽

Open Set ◽

Long Reads ◽

Generation Sequencing

Abstract Background With the development of third-generation sequencing (TGS) technologies, people are able to obtain DNA sequences with lengths from 10s to 100s of kb. These long reads allow protein domain annotation without assembly, thus can produce important insights into the biological functions of the underlying data. However, the high error rate in TGS data raises a new challenge to established domain analysis pipelines. The state-of-the-art methods are not optimized for noisy reads and have shown unsatisfactory accuracy of domain classification in TGS data. New computational methods are still needed to improve the performance of domain prediction in long noisy reads. Results In this work, we introduce ProDOMA, a deep learning model that conducts domain classification for TGS reads. It uses deep neural networks with 3-frame translation encoding to learn conserved features from partially correct translations. In addition, we formulate our problem as an open-set problem and thus our model can reject reads not containing the targeted domains. In the experiments on simulated long reads of protein coding sequences and real TGS reads from the human genome, our model outperforms HMMER and DeepFam on protein domain classification. Conclusions In summary, ProDOMA is a useful end-to-end protein domain analysis tool for long noisy reads without relying on error correction.

Download Full-text

Third-Generation Sequencing: The Spearhead towards the Radical Transformation of Modern Genomics

Life ◽

10.3390/life12010030 ◽

2021 ◽

Vol 12 (1) ◽

pp. 30

Author(s):

Konstantina Athanasopoulou ◽

Michaela A. Boti ◽

Panagiotis G. Adamopoulos ◽

Paraskevi C. Skourou ◽

Andreas Scorilas

Keyword(s):

De Novo ◽

Direct Detection ◽

Transcriptional Profiling ◽

Third Generation ◽

De Novo Genome Assembly ◽

Rna Molecules ◽

Third Generation Sequencing ◽

Long Reads ◽

Long Read ◽

Generation Sequencing

Although next-generation sequencing (NGS) technology revolutionized sequencing, offering a tremendous sequencing capacity with groundbreaking depth and accuracy, it continues to demonstrate serious limitations. In the early 2010s, the introduction of a novel set of sequencing methodologies, presented by two platforms, Pacific Biosciences (PacBio) and Oxford Nanopore Sequencing (ONT), gave birth to third-generation sequencing (TGS). The innovative long-read technologies turn genome sequencing into an ease-of-handle procedure by greatly reducing the average time of library construction workflows and simplifying the process of de novo genome assembly due to the generation of long reads. Long sequencing reads produced by both TGS methodologies have already facilitated the decipherment of transcriptional profiling since they enable the identification of full-length transcripts without the need for assembly or the use of sophisticated bioinformatics tools. Long-read technologies have also provided new insights into the field of epitranscriptomics, by allowing the direct detection of RNA modifications on native RNA molecules. This review highlights the advantageous features of the newly introduced TGS technologies, discusses their limitations and provides an in-depth comparison regarding their scientific background and available protocols as well as their potential utility in research and clinical applications.

Download Full-text

Evaluating approaches to find exon chains based on long reads

10.1101/066241 ◽

2016 ◽

Author(s):

Anna Kuosmanen ◽

Veli Mäkinen

Keyword(s):

Second Generation ◽

Simulated Data ◽

Error Rates ◽

Third Generation ◽

Sequencing Technologies ◽

Third Generation Sequencing ◽

Long Reads ◽

Long Read ◽

Second Generation Sequencing ◽

Generation Sequencing

AbstractMotivationTranscript prediction can be modelled as a graph problem where exons are modelled as nodes and reads spanning two or more exons are modelled as exon chains. PacBio third-generation sequencing technology produces significantly longer reads than earlier second-generation sequencing technologies, which gives valuable information about longer exon chains in a graph. However, with the high error rates of third-generation sequencing, aligning long reads correctly around the splice sites is a challenging task. Incorrect alignments lead to spurious nodes and arcs in the graph, which in turn lead to incorrect transcript predictions.ResultsWe survey several approaches to find the exon chains corresponding to long reads in a splicing graph, and experimentally study the performance of these methods using simulated data to allow for sensitivity / precision analysis. Our experiments show that short reads from second-generation sequencing can be used to significantly improve exon chain correctness either by error-correcting the long reads before splicing graph creation, or by using them to create a splicing graph on which the long read alignments are then projected. We also study the memory and time consumption of various modules, and show that accurate exon chains lead to significantly increased transcript prediction accuracy.AvailabilityThe simulated data and in-house scripts used for this article are available at http://cs.helsinki.fi/u/aekuosma/exon_chain_evaluation_publish.tar.gz.

Download Full-text

HASLR: Fast Hybrid Assembly of Long Reads

10.1101/2020.01.27.921817 ◽

2020 ◽

Cited By ~ 1

Author(s):

Ehsan Haghshenas ◽

Hossein Asghari ◽

Jens Stoye ◽

Cedric Chauve ◽

Faraz Hach

Keyword(s):

Unmet Need ◽

Effective Length ◽

Third Generation ◽

Sequencing Technologies ◽

Third Generation Sequencing ◽

Long Reads ◽

Oxford Nanopore ◽

Long Read ◽

The One ◽

Generation Sequencing

AbstractThird generation sequencing technologies from platforms such as Oxford Nanopore Technologies and Pacific Biosciences have paved the way for building more contiguous assemblies and complete reconstruction of genomes. The larger effective length of the reads generated with these technologies has provided a mean to overcome the challenges of short to mid-range repeats. Currently, accurate long read assemblers are computationally expensive while faster methods are not as accurate. Therefore, there is still an unmet need for tools that are both fast and accurate for reconstructing small and large genomes. Despite the recent advances in third generation sequencing, researchers tend to generate second generation reads for many of the analysis tasks. Here, we present HASLR, a hybrid assembler which uses both second and third generation sequencing reads to efficiently generate accurate genome assemblies. Our experiments show that HASLR is not only the fastest assembler but also the one with the lowest number of misassemblies on all the samples compared to other tested assemblers. Furthermore, the generated assemblies in terms of contiguity and accuracy are on par with the other tools on most of the samples.AvailabilityHASLR is an open source tool available at https://github.com/vpc-ccg/haslr.

Download Full-text