Multithread Multistring Burrows–Wheeler Transform and Longest Common Prefix Array

2019 ◽  
Vol 26 (9) ◽  
pp. 948-961 ◽  
Author(s):  
Paola Bonizzoni ◽  
Gianluca Della Vedova ◽  
Yuri Pirola ◽  
Marco Previtali ◽  
Raffaella Rizzi
2020 ◽  
Vol 15 (1) ◽  
Author(s):  
Felipe A. Louza ◽  
Guilherme P. Telles ◽  
Simon Gog ◽  
Nicola Prezza ◽  
Giovanna Rosone

Abstract Background The construction of a suffix array for a collection of strings is a fundamental task in Bioinformatics and in many other applications that process strings. Related data structures, as the Longest Common Prefix array, the Burrows–Wheeler transform, and the document array, are often needed to accompany the suffix array to efficiently solve a wide variety of problems. While several algorithms have been proposed to construct the suffix array for a single string, less emphasis has been put on algorithms to construct suffix arrays for string collections. Result In this paper we introduce , an open source software for constructing the suffix array and related data indexing structures for a string collection with N symbols in O(N) time. Our tool is written in and is based on the algorithm gSACA-K (Louza et al. in Theor Comput Sci 678:22–39, 2017), the fastest algorithm to construct suffix arrays for string collections. The tool supports large fasta, fastq and text files with multiple strings as input. Experiments have shown very good performance on different types of strings. Conclusions is a fast, portable, and lightweight tool for constructing the suffix array and additional data structures for string collections.


2013 ◽  
Vol 18 ◽  
pp. 22-31 ◽  
Author(s):  
Timo Beller ◽  
Simon Gog ◽  
Enno Ohlebusch ◽  
Thomas Schnattinger

2018 ◽  
Author(s):  
Felipe A. Louza ◽  
Guilherme P. Telles ◽  
Simon Gog

Strings are prevalent in Computer Science and algorithms for their efficient processing are fundamental in various applications. The results introduced in this work contribute with theoretical improvements and practical advances in building full-text indexes. Our first contribution is an in-place algorithm that computes the Burrows-Wheeler transform and the longest common prefix (LCP) array. Our second contribution is the construction of the suffix array augmented with the LCP array in optimal time and space for strings from constant size alphabets. Our third contribution is a set of algorithms to construct full-text indexes for string collections in optimal theoretical bounds. This work is an extended abstract of the Ph.D. thesis of the first author.


Author(s):  
Juha Kärkkäinen ◽  
Giovanni Manzini ◽  
Simon J. Puglisi

2021 ◽  
pp. 143-150
Author(s):  
Tomohiro I ◽  
Robert W. Irving ◽  
Dominik Köppl ◽  
Lorna Love

Author(s):  
W. F. Smyth

Combinatorics on words began more than a century ago with a demonstration that an infinitely long string with no repetitions could be constructed on an alphabet of only three letters. Computing all the repetitions (such as ⋯ TTT ⋯ or ⋯ CGACGA ⋯ ) in a given string x of length n is one of the oldest and most important problems of computational stringology, requiring time in the worst case. About a dozen years ago, it was discovered that repetitions can be computed as a by-product of the Θ ( n )-time computation of all the maximal periodicities or runs in x . However, even though the computation is linear, it is also brute force: global data structures, such as the suffix array , the longest common prefix array and the Lempel–Ziv factorization , need to be computed in a preprocessing phase. Furthermore, all of this effort is required despite the fact that the expected number of runs in a string is generally a small fraction of the string length. In this paper, I explore the possibility that repetitions (perhaps also other regularities in strings) can be computed in a manner commensurate with the size of the output.


Sign in / Sign up

Export Citation Format

Share Document