GPU accelerated partial order multiple sequence alignment for long reads self-correction

AbstractAs third generation sequencing technologies become more reliable and widely used to solve several genome-related problems, self-correction of long reads is becoming the preferred method to reduce the error rate of Pacific Biosciences and Oxford Nanopore long reads, that is now around 10-12%. Several of these self-correction methods rely on some form of Multiple Sequence Alignment (MSA) to obtain a consensus sequence for the original reads. In particular, error-correction tools such as RACON and CONSENT use Partial Order (PO) graph alignment to accomplish this task. PO graph alignment, which is computationally more expensive than optimal global pairwise alignment between two sequences, needs to be performed several times for each read during the error correction process. GPUs have proven very effective in accelerating several compute-intensive tasks in different scientific fields. We harnessed the power of these architectures to accelerate the error correction process of existing self-correction tools, to improve the efficiency of this step of genome analysis.In this paper, we introduce a GPU-accelerated version of the PO alignment presented in the POA v2 software library, implemented on an NVIDIA Tesla V100 GPU. We obtain up to 6.5x speedup compared to 64 CPU threads run on two 2.3 GHz 16-core Intel Xeon Processors E5-2698 v3. In our implementation we focused on the alignment of smaller sequences, as the CONSENT segmentation strategy based on k-mer chaining provides an optimal opportunity to exploit the parallel-processing power of GPUs. To demonstrate this, we have integrated our kernel in the CONSENT software. This accelerated version of CONSENT provides a speedup for the whole error correction step that ranges from 1.95x to 8.5x depending on the input reads.

Download Full-text

Generating consensus sequences from partial order multiple sequence alignment graphs

Bioinformatics ◽

10.1093/bioinformatics/btg109 ◽

2003 ◽

Vol 19 (8) ◽

pp. 999-1008 ◽

Cited By ~ 47

Author(s):

C. Lee

Keyword(s):

Partial Order ◽

Sequence Alignment ◽

Multiple Sequence Alignment ◽

Multiple Sequence ◽

Consensus Sequences

Download Full-text

POAVIZ: a Partial Order Multiple Sequence Alignment Visualizer

Bioinformatics ◽

10.1093/bioinformatics/btg175 ◽

2003 ◽

Vol 19 (11) ◽

pp. 1446-1448 ◽

Cited By ~ 6

Author(s):

C. Grasso ◽

M. Quist ◽

K. Ke ◽

C. Lee

Keyword(s):

Partial Order ◽

Sequence Alignment ◽

Multiple Sequence Alignment ◽

Multiple Sequence

Download Full-text

Multiple sequence alignment using partial order graphs

Bioinformatics ◽

10.1093/bioinformatics/18.3.452 ◽

2002 ◽

Vol 18 (3) ◽

pp. 452-464 ◽

Cited By ~ 495

Author(s):

C. Lee ◽

C. Grasso ◽

M. F. Sharlow

Keyword(s):

Partial Order ◽

Sequence Alignment ◽

Multiple Sequence Alignment ◽

Multiple Sequence

Download Full-text

Multiple Sequence Alignment, Profiles and Partial Order Graphs

Introduction to Computational Proteomics ◽

10.1201/9781420010770-4 ◽

2010 ◽

pp. 105-154

Author(s):

Golan Yona

Keyword(s):

Partial Order ◽

Sequence Alignment ◽

Multiple Sequence Alignment ◽

Multiple Sequence

Download Full-text

Scalable long read self-correction and assembly polishing with multiple sequence alignment

Scientific Reports ◽

10.1038/s41598-020-80757-5 ◽

2021 ◽

Vol 11 (1) ◽

Author(s):

Pierre Morisse ◽

Camille Marchet ◽

Antoine Limasset ◽

Thierry Lecroq ◽

Arnaud Lefebvre

Keyword(s):

Sequence Alignment ◽

Multiple Sequence Alignment ◽

Correction Method ◽

Error Rates ◽

Multiple Sequence ◽

De Bruijn Graphs ◽

Sequencing Technologies ◽

Long Reads ◽

Long Read ◽

Human Dataset

AbstractThird-generation sequencing technologies allow to sequence long reads of tens of kbp, that are expected to solve various problems. However, they display high error rates, currently capped around 10%. Self-correction is thus regularly used in long reads analysis projects. We introduce CONSENT, a new self-correction method that relies both on multiple sequence alignment and local de Bruijn graphs. To ensure scalability, multiple sequence alignment computation benefits from a new and efficient segmentation strategy, allowing a massive speedup. CONSENT compares well to the state-of-the-art, and performs better on real Oxford Nanopore data. Specifically, CONSENT is the only method that efficiently scales to ultra-long reads, and allows to process a full human dataset, containing reads reaching up to 1.5 Mbp, in 10 days. Moreover, our experiments show that error correction with CONSENT improves the quality of Flye assemblies. Additionally, CONSENT implements a polishing feature, allowing to correct raw assemblies. Our experiments show that CONSENT is 2-38x times faster than other polishing tools, while providing comparable results. Furthermore, we show that, on a human dataset, assembling the raw data and polishing the assembly is less resource consuming than correcting and then assembling the reads, while providing better results. CONSENT is available at https://github.com/morispi/CONSENT.

Download Full-text

Combining partial order alignment and progressive multiple sequence alignment increases alignment speed and scalability to very large alignment problems

Bioinformatics ◽

10.1093/bioinformatics/bth126 ◽

2004 ◽

Vol 20 (10) ◽

pp. 1546-1556 ◽

Cited By ~ 70

Author(s):

C. Grasso ◽

C. Lee

Keyword(s):

Partial Order ◽

Sequence Alignment ◽

Multiple Sequence Alignment ◽

Multiple Sequence ◽

Progressive Multiple Sequence Alignment

Download Full-text

CONSENT: Scalable long read self-correction and assembly polishing with multiple sequence alignment

10.1101/546630 ◽

2019 ◽

Cited By ~ 6

Author(s):

Pierre Morisse ◽

Camille Marchet ◽

Antoine Limasset ◽

Thierry Lecroq ◽

Arnaud Lefebvre

Keyword(s):

Sequence Alignment ◽

Multiple Sequence Alignment ◽

State Of The Art ◽

Error Rates ◽

Multiple Sequence ◽

Sequencing Technologies ◽

Long Reads ◽

Oxford Nanopore ◽

Long Read ◽

Generation Sequencing

MotivationThird-generation sequencing technologies Pacific Biosciences and Oxford Nanopore allow the sequencing of long reads of tens of kbp, that are expected to solve various problems, such as contig and haplotype assembly, scaffolding, and structural variant calling. However, they also display high error rates that can reach 10 to 30%, for basic ONT and non-CCS PacBio reads. As a result, error correction is often the first step of projects dealing with long reads. As first long reads sequencing experiments produced reads displaying error rates higher than 15% on average, most methods relied on the complementary use of short reads data to perform correction, in a hybrid approach. However, these sequencing technologies evolve fast, and the error rate of the long reads now reaches 10 to 12%. As a result, self-correction is now frequently used as the first step of third-generation sequencing data analysis projects. As of today, efficient tools allowing to perform self-correction of the long reads are available, and recent observations suggest that avoiding the use of second-generation sequencing reads could bypass their inherent bias.ResultsWe introduce CONSENT, a new method for the self-correction of long reads that combines different strategies from the state-of-the-art. More precisely, we combine a multiple sequence alignment strategy with the use of local de Bruijn graphs. Moreover, the multiple sequence alignment benefits from an efficient segmentation strategy based on k-mer chaining, which allows a considerable speed improvement. Our experiments show that CONSENT compares well to the latest state-of-the-art self-correction methods, and even outperforms them on real Oxford Nanopore datasets. In particular, they show that CONSENT is the only method able to efficiently scale to the correction of Oxford Nanopore ultra-long reads, and is able to process a full human dataset, containing reads reaching lengths up to 1.5 Mbp, in 15 days. Additionally, CONSENT also implements an assembly polishing feature, and is thus able to correct errors directly from raw long read assemblies. Our experiments show that CONSENT outperforms state-of-the-art polishing tools in terms of resource consumption, and provides comparable results. Moreover, we also show that, for a full human dataset, assembling the raw data and polishing the assembly afterwards is less time consuming than assembling the corrected reads, while providing better quality results.Availability and implementationCONSENT is implemented in C++, supported on Linux platforms and freely available at https://github.com/morispi/[email protected]

Download Full-text

GENOME-WIDE DETECTION OF ALTERNATIVE SPLICING IN EXPRESSED SEQUENCES USING PARTIAL ORDER MULTIPLE SEQUENCE ALIGNMENT GRAPHS

Biocomputing 2004 ◽

10.1142/9789812704856_0004 ◽

2003 ◽

Cited By ~ 1

Author(s):

C. GRASSO ◽

B. MODREK ◽

Y. XING ◽

C. LEE

Keyword(s):

Alternative Splicing ◽

Partial Order ◽

Sequence Alignment ◽

Multiple Sequence Alignment ◽

Multiple Sequence ◽

Genome Wide

Download Full-text

PacRAT: A program to improve barcode-variant mapping from PacBio long reads using multiple sequence alignment

10.1101/2021.11.06.467314 ◽

2021 ◽

Author(s):

Chiann-Ling Cindy Yeh ◽

Clara J. Amorosi ◽

Soyeon Showman ◽

Maitreya J. Dunham

Keyword(s):

Sequence Alignment ◽

Multiple Sequence Alignment ◽

Genetic Variants ◽

Pacbio Sequencing ◽

Multiple Sequence ◽

Read Alignment ◽

Long Reads ◽

Alignment Tool ◽

Variant Alleles

Motivation: Use of PacBio sequencing for characterizing barcoded libraries of genetic variants is on the rise. PacBio sequencing is useful in linking variant alleles in a library with their associated barcode tag. However, current approaches in resolving PacBio sequencing artifacts can result in a high number of incorrectly identified or unusable reads. Results: We developed a PacBio Read Alignment Tool (PacRAT) that improves the accuracy of barcode-variant mapping through several steps of read alignment and consensus calling. To quantify the performance of our approach, we simulated PacBio reads from eight variant libraries of various lengths and showed that PacRAT improves the accuracy in pairing barcodes and variants across these libraries. Analysis of real (non-simulated) libraries also showed an increase in the number of reads that can be used for downstream analyses when using PacRAT. Availability and Implementation: PacRAT is written in Python and is freely available on Github (https://github.com/dunhamlab/PacRAT).

Download Full-text