burrows wheeler transform Latest Research Papers

Short-read genome alignment is a fundamental computational step used in many bioinformatic analyses. It is therefore desirable to align such data as fast as possible. Most alignment algorithms consider a seed-and-extend approach. Several popular programs perform the seeding step based on the Burrows-Wheeler Transform with a low memory footprint, but they are relatively slow compared to more recent approaches that use a minimizer-based seeding-and-chaining strategy. Recently, syncmers and strobemers were proposed for sequence comparison. Both protocols were designed for improved conservation of matches between sequences under mutations. Syncmers is a thinning protocol proposed as an alternative to minimizers, while strobemers is a linking protocol for gapped sequences and was proposed as an alternative to k-mers. The main contribution in this work is a new seeding approach that combines syncmers and strobemers. We use a strobemer protocol (randstrobes) to link together syncmers (i.e., in syncmer-space) instead of over the original sequence. Our protocol allows us to create longer seeds while preserving mapping accuracy. A longer seed length reduces the number of candidate regions which allows faster mapping and alignment. We also contribute the insight that speed-wise, this protocol is particularly effective when syncmers are canonical. Canonical syncmers can be created for specific parameter combinations and reduce the computational burden of computing the non-canonical randstrobes in reverse complement. We implement our idea in a proof-of-concept short-read aligner strobealign that aligns short reads 3-4x faster than minimap2 and 15-23x faster than BWA and Bowtie2. Many implementation versions of, e.g., BWA, achieve high speed on specific hardware. Our contribution is algorithmic and requires no hardware architecture or system-specific instructions. Strobealign is available at https://github.com/ksahlin/StrobeAlign.

Download Full-text

Computation of the suffix array, burrows-wheeler transform and FM-index in V-order

Theoretical Computer Science ◽

10.1016/j.tcs.2021.06.004 ◽

2021 ◽

Author(s):

Jacqueline W. Daykin ◽

Neerja Mhaskar ◽

W.F. Smyth

Keyword(s):

Suffix Array ◽

Burrows Wheeler Transform

Download Full-text

Simplitigs as an efficient and scalable representation of de Bruijn graphs

Genome Biology ◽

10.1186/s13059-021-02297-z ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Karel Břinda ◽

Michael Baym ◽

Gregory Kucherov

Keyword(s):

Fast Algorithm ◽

Large Scale ◽

Essential Role ◽

Substantial Improvement ◽

Model Organisms ◽

Sequence Length ◽

De Bruijn Graphs ◽

De Bruijn ◽

Burrows Wheeler Transform

Abstractde Bruijn graphs play an essential role in bioinformatics, yet they lack a universal scalable representation. Here, we introduce simplitigs as a compact, efficient, and scalable representation, and ProphAsm, a fast algorithm for their computation. For the example of assemblies of model organisms and two bacterial pan-genomes, we compare simplitigs to unitigs, the best existing representation, and demonstrate that simplitigs provide a substantial improvement in the cumulative sequence length and their number. When combined with the commonly used Burrows-Wheeler Transform index, simplitigs reduce memory, and index loading and query times, as demonstrated with large-scale examples of GenBank bacterial pan-genomes.

Download Full-text

d-PBWT: dynamic positional Burrows-Wheeler transform

Bioinformatics ◽

10.1093/bioinformatics/btab117 ◽

2021 ◽

Author(s):

Ahsan Sanaullah ◽

Degui Zhi ◽

Shaojie Zhang

Keyword(s):

Data Structure ◽

Genotype Imputation ◽

Supplementary Information ◽

Worst Case ◽

Average Case ◽

Insertion And Deletion ◽

Static Data ◽

Efficient Retrieval ◽

Dynamic Data Structure ◽

Burrows Wheeler Transform

Abstract Motivation Durbin’s positional Burrows-Wheeler transform (PBWT) is a scalable data structure for haplotype matching. It has been successfully applied to identical by descent (IBD) segment identification and genotype imputation. Once the PBWT of a haplotype panel is constructed, it supports efficient retrieval of all shared long segments among all individuals (long matches) and efficient query between an external haplotype and the panel. However, the standard PBWT is an array-based static data structure and does not support dynamic updates of the panel. Results Here, we generalize the static PBWT to a dynamic data structure, d-PBWT, where the reverse prefix sorting at each position is stored with linked lists.We also developed efficient algorithms for insertion and deletion of individual haplotypes. In addition, we verified that d-PBWT can support all algorithms of PBWT. In doing so, we systematically investigated variations of set maximal match and long match query algorithms: while they all have average case time complexity independent of database size, they have different worst case complexities and dependencies on additional data structures. Availability The benchmarking code is available at genome.ucf.edu/d-PBWT. Supplementary information Supplementary Materials are available at Bioinformatics online.

Download Full-text

Burrows-Wheeler transform acceleration based on CUDA

Proceedings of International Conference on Artificial Life and Robotics ◽

10.5954/icarob.2021.os13-2 ◽

2021 ◽

Vol 26 ◽

pp. 596-599

Author(s):

Chang Sheng ◽

Fengzhi Dai

Keyword(s):

Burrows Wheeler Transform

Download Full-text

Lossless text compression using GPT-2 language model and Huffman coding

SHS Web of Conferences ◽

10.1051/shsconf/202110204013 ◽

2021 ◽

Vol 102 ◽

pp. 04013

Author(s):

Md. Atiqur Rahman ◽

Mohamed Hamada

Keyword(s):

Data Compression ◽

State Of The Art ◽

Language Model ◽

Huffman Coding ◽

Original Text ◽

Text Compression ◽

Compression Technique ◽

Daily Life Activities ◽

Burrows Wheeler Transform ◽

Compressed Data

Modern daily life activities produced lots of information for the advancement of telecommunication. It is a challenging issue to store them on a digital device or transmit it over the Internet, leading to the necessity for data compression. Thus, research on data compression to solve the issue has become a topic of great interest to researchers. Moreover, the size of compressed data is generally smaller than its original. As a result, data compression saves storage and increases transmission speed. In this article, we propose a text compression technique using GPT-2 language model and Huffman coding. In this proposed method, Burrows-Wheeler transform and a list of keys are used to reduce the original text file’s length. Finally, we apply GPT-2 language mode and then Huffman coding for encoding. This proposed method is compared with the state-of-the-art techniques used for text compression. Finally, we show that the proposed method demonstrates a gain in compression ratio compared to the other state-of-the-art methods.

Download Full-text

Novel Results on the Number of Runs of the Burrows-Wheeler-Transform

SOFSEM 2021: Theory and Practice of Computer Science - Lecture Notes in Computer Science ◽

10.1007/978-3-030-67731-2_18 ◽

2021 ◽

pp. 249-262

Author(s):

Sara Giuliani ◽

Shunsuke Inenaga ◽

Zsuzsanna Lipták ◽

Nicola Prezza ◽

Marinella Sciortino ◽

...

Keyword(s):

Burrows Wheeler Transform

Download Full-text

Fast and robust identity-by-descent inference with the templated positional Burrows-Wheeler transform

Molecular Biology and Evolution ◽

10.1093/molbev/msaa328 ◽

2020 ◽

Author(s):

William A Freyman ◽

Kimberly F McManus ◽

Suyash S Shringarpure ◽

Ethan M Jewett ◽

Katarzyna Bryc ◽

...

Keyword(s):

Isolation By Distance ◽

False Negative ◽

Segment Length ◽

Data Sets ◽

Haplotype Sharing ◽

Binary File ◽

Inference Algorithms ◽

Out Of Sample ◽

Massive Scale ◽

Burrows Wheeler Transform

Abstract Estimating the genomic location and length of identical-by-descent (IBD) segments among individuals is a crucial step in many genetic analyses. However, the exponential growth in the size of biobank and direct-to-consumer (DTC) genetic data sets makes accurate IBD inference a significant computational challenge. Here we present the templated positional Burrows-Wheeler transform (TPBWT) to make fast IBD estimates robust to genotype and phasing errors. Using haplotype data simulated over pedigrees with realistic genotyping and phasing errors we show that the TPBWT outperforms other state-of-the-art IBD inference algorithms in terms of speed and accuracy. For each phase-aware method, we explore the false positive and false negative rates of inferring IBD by segment length and characterize the types of error commonly found. Our results highlight the fragility of most phased IBD inference methods; the accuracy of IBD estimates can be highly sensitive to the quality of haplotype phasing. Additionally we compare the performance of the TPBWT against a widely used phase-free IBD inference approach that is robust to phasing errors. We introduce both in-sample and out-of-sample TPBWT-based IBD inference algorithms and demonstrate their computational efficiency on massive-scale datasets with millions of samples. Furthermore we describe the binary file format for TPBWT-compressed haplotypes that results in fast and efficient out-of-sample IBD computes against very large cohort panels. Finally, we demonstrate the utility of the TPBWT in a brief empirical analysis exploring geographic patterns of haplotype sharing within Mexico. Hierarchical clustering of IBD shared across regions within Mexico reveals geographically structured haplotype sharing and a strong signal of isolation by distance. Our software implementation of the TPBWT is freely available for non-commercial use in the code repository https://github.com/23andMe/phasedibd.

Download Full-text

burrows wheeler transform
Recently Published Documents

TOTAL DOCUMENTS

H-INDEX

Integrative Comparison of Burrows-Wheeler Transform-Based Mapping algorithm with de Bruijn graph for Identification of Lung/Liver Cancer-specific Gene

An effective image compression technique based on burrows wheeler transform with set partitioning in hierarchical trees

Faster short-read mapping with strobemer seeds in syncmer space

Computation of the suffix array, burrows-wheeler transform and FM-index in V-order

Simplitigs as an efficient and scalable representation of de Bruijn graphs

d-PBWT: dynamic positional Burrows-Wheeler transform

Burrows-Wheeler transform acceleration based on CUDA

Lossless text compression using GPT-2 language model and Huffman coding

Novel Results on the Number of Runs of the Burrows-Wheeler-Transform

Fast and robust identity-by-descent inference with the templated positional Burrows-Wheeler transform

Export Citation Format

burrows wheeler transformRecently Published Documents

TOTAL DOCUMENTS

H-INDEX

Integrative Comparison of Burrows-Wheeler Transform-Based Mapping algorithm with de Bruijn graph for Identification of Lung/Liver Cancer-specific Gene

An effective image compression technique based on burrows wheeler transform with set partitioning in hierarchical trees

Faster short-read mapping with strobemer seeds in syncmer space

Computation of the suffix array, burrows-wheeler transform and FM-index in V-order

Simplitigs as an efficient and scalable representation of de Bruijn graphs

d-PBWT: dynamic positional Burrows-Wheeler transform

Burrows-Wheeler transform acceleration based on CUDA

Lossless text compression using GPT-2 language model and Huffman coding

Novel Results on the Number of Runs of the Burrows-Wheeler-Transform

Fast and robust identity-by-descent inference with the templated positional Burrows-Wheeler transform

burrows wheeler transform
Recently Published Documents