Multithread Multistring Burrows–Wheeler Transform and Longest Common Prefix Array

Abstract Background The construction of a suffix array for a collection of strings is a fundamental task in Bioinformatics and in many other applications that process strings. Related data structures, as the Longest Common Prefix array, the Burrows–Wheeler transform, and the document array, are often needed to accompany the suffix array to efficiently solve a wide variety of problems. While several algorithms have been proposed to construct the suffix array for a single string, less emphasis has been put on algorithms to construct suffix arrays for string collections. Result In this paper we introduce , an open source software for constructing the suffix array and related data indexing structures for a string collection with N symbols in O(N) time. Our tool is written in and is based on the algorithm gSACA-K (Louza et al. in Theor Comput Sci 678:22–39, 2017), the fastest algorithm to construct suffix arrays for string collections. The tool supports large fasta, fastq and text files with multiple strings as input. Experiments have shown very good performance on different types of strings. Conclusions is a fast, portable, and lightweight tool for constructing the suffix array and additional data structures for string collections.

Download Full-text

Computing the longest common prefix array based on the Burrows–Wheeler transform

Journal of Discrete Algorithms ◽

10.1016/j.jda.2012.07.007 ◽

2013 ◽

Vol 18 ◽

pp. 22-31 ◽

Cited By ~ 42

Author(s):

Timo Beller ◽

Simon Gog ◽

Enno Ohlebusch ◽

Thomas Schnattinger

Keyword(s):

Common Prefix ◽

Burrows Wheeler Transform ◽

Longest Common Prefix Array

Download Full-text

Computing the Longest Common Prefix Array Based on the Burrows-Wheeler Transform

String Processing and Information Retrieval - Lecture Notes in Computer Science ◽

10.1007/978-3-642-24583-1_20 ◽

2011 ◽

pp. 197-208 ◽

Cited By ~ 6

Author(s):

Timo Beller ◽

Simon Gog ◽

Enno Ohlebusch ◽

Thomas Schnattinger

Keyword(s):

Common Prefix ◽

Burrows Wheeler Transform ◽

Longest Common Prefix Array

Download Full-text

Engineering Augmented Suffix Sorting Algorithms

10.5753/ctd.2018.3652 ◽

2018 ◽

Author(s):

Felipe A. Louza ◽

Guilherme P. Telles ◽

Simon Gog

Keyword(s):

Computer Science ◽

Full Text ◽

Suffix Array ◽

Optimal Time ◽

Time And Space ◽

Sorting Algorithms ◽

Constant Size ◽

Common Prefix ◽

Efficient Processing ◽

Burrows Wheeler Transform

Strings are prevalent in Computer Science and algorithms for their efficient processing are fundamental in various applications. The results introduced in this work contribute with theoretical improvements and practical advances in building full-text indexes. Our first contribution is an in-place algorithm that computes the Burrows-Wheeler transform and the longest common prefix (LCP) array. Our second contribution is the construction of the suffix array augmented with the LCP array in optimal time and space for strings from constant size alphabets. Our third contribution is a set of algorithms to construct full-text indexes for string collections in optimal theoretical bounds. This work is an extended abstract of the Ph.D. thesis of the first author.

Download Full-text

Permuted Longest-Common-Prefix Array

Combinatorial Pattern Matching - Lecture Notes in Computer Science ◽

10.1007/978-3-642-02441-2_17 ◽

2009 ◽

pp. 181-192 ◽

Cited By ~ 63

Author(s):

Juha Kärkkäinen ◽

Giovanni Manzini ◽

Simon J. Puglisi

Keyword(s):

Common Prefix ◽

Longest Common Prefix Array

Download Full-text

Space-Time Tradeoffs for Longest-Common-Prefix Array Computation

Algorithms and Computation - Lecture Notes in Computer Science ◽

10.1007/978-3-540-92182-0_14 ◽

2008 ◽

pp. 124-135 ◽

Cited By ~ 36

Author(s):

Simon J. Puglisi ◽

Andrew Turpin

Keyword(s):

Space Time ◽

Common Prefix ◽

Longest Common Prefix Array

Download Full-text

Extracting the Sparse Longest Common Prefix Array from the Suffix Binary Search Tree

10.1007/978-3-030-86692-1_12 ◽

2021 ◽

pp. 143-150

Author(s):

Tomohiro I ◽

Robert W. Irving ◽

Dominik Köppl ◽

Lorna Love

Keyword(s):

Search Tree ◽

Binary Search ◽

Binary Search Tree ◽

Common Prefix ◽

Longest Common Prefix Array

Download Full-text

The Colored Longest Common Prefix Array Computed via Sequential Scans

String Processing and Information Retrieval - Lecture Notes in Computer Science ◽

10.1007/978-3-030-00479-8_13 ◽

2018 ◽

pp. 153-167 ◽

Cited By ~ 2

Author(s):

Fabio Garofalo ◽

Giovanna Rosone ◽

Marinella Sciortino ◽

Davide Verzotto

Keyword(s):

Common Prefix ◽

Longest Common Prefix Array

Download Full-text

Large-scale detection of repetitions

Philosophical Transactions of The Royal Society A Mathematical Physical and Engineering Sciences ◽

10.1098/rsta.2013.0138 ◽

2014 ◽

Vol 372 (2016) ◽

pp. 20130138 ◽

Cited By ~ 1

Author(s):

W. F. Smyth

Keyword(s):

Large Scale ◽

Combinatorics On Words ◽

Brute Force ◽

Expected Number ◽

String Length ◽

Worst Case ◽

Common Prefix ◽

Scale Detection ◽

Longest Common Prefix Array ◽

Global Data

Combinatorics on words began more than a century ago with a demonstration that an infinitely long string with no repetitions could be constructed on an alphabet of only three letters. Computing all the repetitions (such as ⋯ TTT ⋯ or ⋯ CGACGA ⋯ ) in a given string x of length n is one of the oldest and most important problems of computational stringology, requiring time in the worst case. About a dozen years ago, it was discovered that repetitions can be computed as a by-product of the Θ ( n )-time computation of all the maximal periodicities or runs in x . However, even though the computation is linear, it is also brute force: global data structures, such as the suffix array , the longest common prefix array and the Lempel–Ziv factorization , need to be computed in a preprocessing phase. Furthermore, all of this effort is required despite the fact that the expected number of runs in a string is generally a small fraction of the string length. In this paper, I explore the possibility that repetitions (perhaps also other regularities in strings) can be computed in a manner commensurate with the size of the output.

Download Full-text

Sampled Longest Common Prefix Array

Combinatorial Pattern Matching - Lecture Notes in Computer Science ◽

10.1007/978-3-642-13509-5_21 ◽

2010 ◽

pp. 227-237 ◽

Cited By ~ 26

Author(s):

Jouni Sirén

Keyword(s):

Common Prefix ◽

Longest Common Prefix Array

Download Full-text