suffix trees Latest Research Papers

Abstract Background Alignment-free methods are a popular approach for comparing biological sequences, including complete genomes. The methods range from probability distributions of sequence composition to first and higher-order Markov chains, where a k-th order Markov chain over DNA has $$4^k$$ 4 k formal parameters. To circumvent this exponential growth in parameters, variable-length Markov chains (VLMCs) have gained popularity for applications in molecular biology and other areas. VLMCs adapt the depth depending on sequence context and thus curtail excesses in the number of parameters. The scarcity of available fast, or even parallel software tools, prompted the development of a parallel implementation using lazy suffix trees and a hash-based alternative. Results An extensive evaluation was performed on genomes ranging from 12Mbp to 22Gbp. Relevant learning parameters were chosen guided by the Bayesian Information Criterion (BIC) to avoid over-fitting. Our implementation greatly improves upon the state-of-the-art even in serial execution. It exhibits very good parallel scaling with speed-ups for long sequences close to the optimum indicated by Amdahl’s law of 3 for 4 threads and about 6 for 16 threads, respectively. Conclusions Our parallel implementation released as open-source under the GPLv3 license provides a practically useful alternative to the state-of-the-art which allows the construction of VLMCs even for very large genomes significantly faster than previously possible. Additionally, our parameter selection based on BIC gives guidance to end-users comparing genomes.

Download Full-text

Complex event forecasting with prediction suffix trees

The VLDB Journal ◽

10.1007/s00778-021-00698-x ◽

2021 ◽

Author(s):

Elias Alevizos ◽

Alexander Artikis ◽

Georgios Paliouras

Keyword(s):

Suffix Trees ◽

Event Forecasting

Download Full-text

Reversed Lempel–Ziv Factorization with Suffix Trees

Algorithms ◽

10.3390/a14060161 ◽

2021 ◽

Vol 14 (6) ◽

pp. 161

Author(s):

Dominik Köppl

Keyword(s):

Suffix Tree ◽

Linear Time ◽

Suffix Trees ◽

Tree Representations ◽

Linear Time Algorithms

We present linear-time algorithms computing the reversed Lempel–Ziv factorization [Kolpakov and Kucherov, TCS’09] within the space bounds of two different suffix tree representations. We can adapt these algorithms to compute the longest previous non-overlapping reverse factor table [Crochemore et al., JDA’12] within the same space but pay a multiplicative logarithmic time penalty.

Download Full-text

A Character Frequency based Approach to Search for Substrings of a Circular Pattern and its Conjugates in an Online Text

Computer Science ◽

10.7494/csci.2021.22.2.3401 ◽

2021 ◽

Vol 22 (2) ◽

Author(s):

Vinod Prasad

Keyword(s):

Computational Biology ◽

Data Structures ◽

Fundamental Problem ◽

Text Processing ◽

Suffix Trees ◽

Linear Pattern ◽

Circular Pattern ◽

String Processing ◽

Novel Method ◽

Processing Algorithms

A fundamental problem in computational biology is to deal with circular patterns. The problem consists of finding the least certain length substrings of a pattern and its rotations in the database. In this paper, a novel method is presented to deal with circular patterns. The problem is solved using two incremental steps. First, an algorithm is provided that reports all substrings of a given linear pattern in an online text. Next, without losing efficiency, the algorithm is extended to process all circular rotations of the pattern. For a given pattern P of size M, and a text T of size N, the algorithm reports all locations in the text where a substring of Pc is found, where Pc is one of the rotations of P. For an alphabet size σ, using O(M) space, desired goals are achieved in an average O(MN/σ) time, which is O(N) for all patterns of length M ≤ σ. Traditional string processing algorithms make use of advanced data structures such as suffix trees and automaton. We show that basic data structures such as arrays can be used in the text processing algorithms without compromising the efficiency.

Download Full-text

Faster repetition-aware compressed suffix trees based on Block Trees

Information and Computation ◽

10.1016/j.ic.2021.104749 ◽

2021 ◽

pp. 104749

Author(s):

Manuel Cáceres ◽

Gonzalo Navarro

Keyword(s):

Suffix Trees

Download Full-text

Non-Overlapping LZ77 Factorization and LZ78 Substring Compression Queries with Suffix Trees

Algorithms ◽

10.3390/a14020044 ◽

2021 ◽

Vol 14 (2) ◽

pp. 44

Author(s):

Dominik Köppl

Keyword(s):

Suffix Tree ◽

Linear Time ◽

Suffix Trees ◽

Small Space ◽

Tree Representation ◽

Limited Space ◽

Tree Representations

We present algorithms computing the non-overlapping Lempel–Ziv-77 factorization and the longest previous non-overlapping factor table within small space in linear or near-linear time with the help of modern suffix tree representations fitting into limited space. With similar techniques, we show how to answer substring compression queries for the Lempel–Ziv-78 factorization with a possible logarithmic multiplicative slowdown depending on the used suffix tree representation.

Download Full-text

Subpath Queries on Compressed Graphs: A Survey

Algorithms ◽

10.3390/a14010014 ◽

2021 ◽

Vol 14 (1) ◽

pp. 14

Author(s):

Nicola Prezza

Keyword(s):

Regular Languages ◽

Optimal Time ◽

Algorithmic Problem ◽

Suffix Trees ◽

Plain Text ◽

Labeled Graphs ◽

Directed Paths ◽

Finite State ◽

Recent Trends ◽

Year 2000

Text indexing is a classical algorithmic problem that has been studied for over four decades: given a text T, pre-process it off-line so that, later, we can quickly count and locate the occurrences of any string (the query pattern) in T in time proportional to the query’s length. The earliest optimal-time solution to the problem, the suffix tree, dates back to 1973 and requires up to two orders of magnitude more space than the plain text just to be stored. In the year 2000, two breakthrough works showed that efficient queries can be achieved without this space overhead: a fast index be stored in a space proportional to the text’s entropy. These contributions had an enormous impact in bioinformatics: today, virtually any DNA aligner employs compressed indexes. Recent trends considered more powerful compression schemes (dictionary compressors) and generalizations of the problem to labeled graphs: after all, texts can be viewed as labeled directed paths. In turn, since finite state automata can be considered as a particular case of labeled graphs, these findings created a bridge between the fields of compressed indexing and regular language theory, ultimately allowing to index regular languages and promising to shed new light on problems, such as regular expression matching. This survey is a gentle introduction to the main landmarks of the fascinating journey that took us from suffix trees to today’s compressed indexes for labeled graphs and regular languages.

Download Full-text