Burrows-Wheeler transform acceleration based on CUDA

Abstract Estimating the genomic location and length of identical-by-descent (IBD) segments among individuals is a crucial step in many genetic analyses. However, the exponential growth in the size of biobank and direct-to-consumer (DTC) genetic data sets makes accurate IBD inference a significant computational challenge. Here we present the templated positional Burrows-Wheeler transform (TPBWT) to make fast IBD estimates robust to genotype and phasing errors. Using haplotype data simulated over pedigrees with realistic genotyping and phasing errors we show that the TPBWT outperforms other state-of-the-art IBD inference algorithms in terms of speed and accuracy. For each phase-aware method, we explore the false positive and false negative rates of inferring IBD by segment length and characterize the types of error commonly found. Our results highlight the fragility of most phased IBD inference methods; the accuracy of IBD estimates can be highly sensitive to the quality of haplotype phasing. Additionally we compare the performance of the TPBWT against a widely used phase-free IBD inference approach that is robust to phasing errors. We introduce both in-sample and out-of-sample TPBWT-based IBD inference algorithms and demonstrate their computational efficiency on massive-scale datasets with millions of samples. Furthermore we describe the binary file format for TPBWT-compressed haplotypes that results in fast and efficient out-of-sample IBD computes against very large cohort panels. Finally, we demonstrate the utility of the TPBWT in a brief empirical analysis exploring geographic patterns of haplotype sharing within Mexico. Hierarchical clustering of IBD shared across regions within Mexico reveals geographically structured haplotype sharing and a strong signal of isolation by distance. Our software implementation of the TPBWT is freely available for non-commercial use in the code repository https://github.com/23andMe/phasedibd.

Download Full-text

On scrambling the Burrows–Wheeler transform to provide privacy in lossless compression

Computers & Security ◽

10.1016/j.cose.2011.11.005 ◽

2012 ◽

Vol 31 (1) ◽

pp. 26-32 ◽

Cited By ~ 1

Author(s):

M. Oğuzhan Külekci

Keyword(s):

Lossless Compression ◽

Burrows Wheeler Transform

Download Full-text

Multi-allelic positional burrows-wheeler transform

2017 IEEE 7th International Conference on Computational Advances in Bio and Medical Sciences (ICCABS) ◽

10.1109/iccabs.2017.8114291 ◽

2017 ◽

Author(s):

Ardalan Naseri ◽

Degui Zhi ◽

Shaojie Zhang

Keyword(s):

Burrows Wheeler Transform

Download Full-text

Faster short-read mapping with strobemer seeds in syncmer space

10.1101/2021.06.18.449070 ◽

2021 ◽

Author(s):

Kristoffer Sahlin

Keyword(s):

High Speed ◽

Mapping Accuracy ◽

Original Sequence ◽

Short Read ◽

Short Read Mapping ◽

Alignment Algorithms ◽

Reverse Complement ◽

Short Read Aligner ◽

Candidate Regions ◽

Burrows Wheeler Transform

Short-read genome alignment is a fundamental computational step used in many bioinformatic analyses. It is therefore desirable to align such data as fast as possible. Most alignment algorithms consider a seed-and-extend approach. Several popular programs perform the seeding step based on the Burrows-Wheeler Transform with a low memory footprint, but they are relatively slow compared to more recent approaches that use a minimizer-based seeding-and-chaining strategy. Recently, syncmers and strobemers were proposed for sequence comparison. Both protocols were designed for improved conservation of matches between sequences under mutations. Syncmers is a thinning protocol proposed as an alternative to minimizers, while strobemers is a linking protocol for gapped sequences and was proposed as an alternative to k-mers. The main contribution in this work is a new seeding approach that combines syncmers and strobemers. We use a strobemer protocol (randstrobes) to link together syncmers (i.e., in syncmer-space) instead of over the original sequence. Our protocol allows us to create longer seeds while preserving mapping accuracy. A longer seed length reduces the number of candidate regions which allows faster mapping and alignment. We also contribute the insight that speed-wise, this protocol is particularly effective when syncmers are canonical. Canonical syncmers can be created for specific parameter combinations and reduce the computational burden of computing the non-canonical randstrobes in reverse complement. We implement our idea in a proof-of-concept short-read aligner strobealign that aligns short reads 3-4x faster than minimap2 and 15-23x faster than BWA and Bowtie2. Many implementation versions of, e.g., BWA, achieve high speed on specific hardware. Our contribution is algorithmic and requires no hardware architecture or system-specific instructions. Strobealign is available at https://github.com/ksahlin/StrobeAlign.

Download Full-text

Average Linear Time and Compressed Space Construction of the Burrows-Wheeler Transform

Language and Automata Theory and Applications - Lecture Notes in Computer Science ◽

10.1007/978-3-319-15579-1_46 ◽

2015 ◽

pp. 587-598 ◽

Cited By ~ 2

Author(s):

Alberto Policriti ◽

Nicola Gigante ◽

Nicola Prezza

Keyword(s):

Linear Time ◽

Space Construction ◽

Burrows Wheeler Transform

Download Full-text

Video coding on fast curvelet transform and burrows wheeler transform (Bch)

2016 International Conference on Circuit, Power and Computing Technologies (ICCPCT) ◽

10.1109/iccpct.2016.7530364 ◽

2016 ◽

Cited By ~ 2

Author(s):

S. S. Nithin ◽

L. Padma Suresh

Keyword(s):

Video Coding ◽

Curvelet Transform ◽

Burrows Wheeler Transform

Download Full-text

A secure image steganography based on burrows wheeler transform and dynamic bit embedding

International Journal of Electrical and Computer Engineering (IJECE) ◽

10.11591/ijece.v9i1.pp460-467 ◽

2019 ◽

Vol 9 (1) ◽

pp. 460

Author(s):

Ahmed Toman Thahab

Keyword(s):

Communication Networks ◽

Color Image ◽

Image Steganography ◽

Digital Data ◽

Embedding Capacity ◽

Image Size ◽

Secret Data ◽

Exclusive Or ◽

Burrows Wheeler Transform ◽

Color Image Steganography

In modern public communication networks, digital data is massively transmitted through the internet with a high risk of data piracy. Steganography is a technique used to transmit data without arousing suspicion of secret data existence. In this paper, a color image steganography technique is proposed in spatial domain. The cover image is segmented into non-overlapping blocks which are scattered among image size window using Burrows Wheeler transform before embedding. Secret data is embedded in each block according to its sequence in the Burrows Wheeler transform output. The hiding method is an operation of an exclusive-or between a virtual bit which is generated from the most significant bit and the least significant bits of the cover pixel. Results of the algorithm are analyzed according to its degradation of the output image and embedding capacity. The results are also compared with other existing methods.

Download Full-text

A New Burrows Wheeler Transform Markov Distance

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i04.5994 ◽

2020 ◽

Vol 34 (04) ◽

pp. 5444-5453

Author(s):

Edward Raff ◽

Charles Nicholas ◽

Mark McLean

Keyword(s):

Dna Sequence ◽

Distance Measure ◽

Feature Vector ◽

Distance Metrics ◽

Prior Work ◽

Compression Algorithms ◽

Fixed Length ◽

Malware Classification ◽

Classification Tasks ◽

Burrows Wheeler Transform

Prior work inspired by compression algorithms has described how the Burrows Wheeler Transform can be used to create a distance measure for bioinformatics problems. We describe issues with this approach that were not widely known, and introduce our new Burrows Wheeler Markov Distance (BWMD) as an alternative. The BWMD avoids the shortcomings of earlier efforts, and allows us to tackle problems in variable length DNA sequence clustering. BWMD is also more adaptable to other domains, which we demonstrate on malware classification tasks. Unlike other compression-based distance metrics known to us, BWMD works by embedding sequences into a fixed-length feature vector. This allows us to provide significantly improved clustering performance on larger malware corpora, a weakness of prior methods.

Download Full-text

Engineering Augmented Suffix Sorting Algorithms

10.5753/ctd.2018.3652 ◽

2018 ◽

Author(s):

Felipe A. Louza ◽

Guilherme P. Telles ◽

Simon Gog

Keyword(s):

Computer Science ◽

Full Text ◽

Suffix Array ◽

Optimal Time ◽

Time And Space ◽

Sorting Algorithms ◽

Constant Size ◽

Common Prefix ◽

Efficient Processing ◽

Burrows Wheeler Transform

Strings are prevalent in Computer Science and algorithms for their efficient processing are fundamental in various applications. The results introduced in this work contribute with theoretical improvements and practical advances in building full-text indexes. Our first contribution is an in-place algorithm that computes the Burrows-Wheeler transform and the longest common prefix (LCP) array. Our second contribution is the construction of the suffix array augmented with the LCP array in optimal time and space for strings from constant size alphabets. Our third contribution is a set of algorithms to construct full-text indexes for string collections in optimal theoretical bounds. This work is an extended abstract of the Ph.D. thesis of the first author.

Download Full-text