Suffix Tree Data Structures for Matrices

Author(s):  
R. Giancarlo ◽  
R. Grossi

We discuss the suffix tree generalization to matrices in this chapter. We extend the suffix tree notion (described in Chapter 3) from text strings to text matrices whose entries are taken from an ordered alphabet with the aim of solving pattern-matching problems. This suffix tree generalization can be efficiently used to implement low-level routines for Computer Vision, Data Compression, Geographic Information Systems and Visual Databases. We examine the submatrices in the form of the text’s contiguous parts that still have a matrix shape. Representing these text submatrices as “suitably formatted” strings stored in a compacted trie is the rationale behind suffix trees for matrices. The choice of the format inevitably influences suffix tree construction time and space complexity. We first deal with square matrices and show that many suffix tree families can be defined for the same input matrix according to the matrix’s string representations. We can store each suffix tree in linear space and give an efficient construction algorithm whose input is both the matrix and the string representation chosen. We then treat rectangular matrices and define their corresponding suffix trees by means of some general rules which we list formally. We show that there is a super-linear lower bound to the space required (in contrast with the linear space required by suffix trees for square matrices). We give a simple example of one of these suffix trees. The last part of the chapter illustrates some technical results regarding suffix trees for square matrices: we show how to achieve an expected linear-time suffix tree construction for a constant-size alphabet under some mild probabilistic assumptions about the input distribution. We begin by defining a wide class of string representations for square matrices. We let Σ denote an ordered alphabet of characters and introduce another alphabet of five special characters, called shapes. A shape is one of the special characters taken from set {IN,SW,NW,SE,NE}. Shape IN encodes the 1x1 matrix generated from the empty matrix by creating a square.

2009 ◽  
Vol 20 (06) ◽  
pp. 1109-1133 ◽  
Author(s):  
JIE LIN ◽  
YUE JIANG ◽  
DON ADJEROH

We introduce the VST (virtual suffix tree), an efficient data structure for suffix trees and suffix arrays. Starting from the suffix array, we construct the suffix tree, from which we derive the virtual suffix tree. Later, we remove the intermediate step of suffix tree construction, and build the VST directly from the suffix array. The VST provides the same functionality as the suffix tree, including suffix links, but at a much smaller space requirement. It has the same linear time construction even for large alphabets, Σ, requires O(n) space to store (n is the string length), and allows searching for a pattern of length m to be performed in O(m log |Σ|) time, the same time needed for a suffix tree. Given the VST, we show an algorithm that computes all the suffix links in linear time, independent of Σ. The VST requires less space than other recently proposed data structures for suffix trees and suffix arrays, such as the enhanced suffix array [1], and the linearized suffix tree [17]. On average, the space requirement (including that for suffix arrays and suffix links) is 13.8n bytes for the regular VST, and 12.05n bytes in its compact form.


1996 ◽  
Vol 06 (01) ◽  
pp. 35-44 ◽  
Author(s):  
DANY BRESLAUER ◽  
RAMESH HARIHARAN

This paper gives optimal parallel algorithms for the construction of the smallest deterministic finite automata recognizing all the suffixes and the factors of a string. The algorithms use recently discovered optimal parallel suffix tree construction algorithms together with data structures for the efficient manipulation of trees, exploiting the well known relation between suffix and factor automata and suffix trees.


Algorithms ◽  
2021 ◽  
Vol 14 (6) ◽  
pp. 161
Author(s):  
Dominik Köppl

We present linear-time algorithms computing the reversed Lempel–Ziv factorization [Kolpakov and Kucherov, TCS’09] within the space bounds of two different suffix tree representations. We can adapt these algorithms to compute the longest previous non-overlapping reverse factor table [Crochemore et al., JDA’12] within the same space but pay a multiplicative logarithmic time penalty.


Algorithms ◽  
2021 ◽  
Vol 14 (2) ◽  
pp. 44
Author(s):  
Dominik Köppl

We present algorithms computing the non-overlapping Lempel–Ziv-77 factorization and the longest previous non-overlapping factor table within small space in linear or near-linear time with the help of modern suffix tree representations fitting into limited space. With similar techniques, we show how to answer substring compression queries for the Lempel–Ziv-78 factorization with a possible logarithmic multiplicative slowdown depending on the used suffix tree representation.


2019 ◽  
Vol 20 (S25) ◽  
Author(s):  
Jin Zhao ◽  
Haodi Feng ◽  
Daming Zhu ◽  
Chi Zhang ◽  
Ying Xu

Abstract Background Alternative splicing allows the pre-mRNAs of a gene to be spliced into various mRNAs, which greatly increases the diversity of proteins. High-throughput sequencing of mRNAs has revolutionized our ability for transcripts reconstruction. However, the massive size of short reads makes de novo transcripts assembly an algorithmic challenge. Results We develop a novel radical framework, called DTA-SiST, for de novo transcriptome assembly based on suffix trees. DTA-SiST first extends contigs by reads that have the longest overlaps with the contigs’ terminuses. These reads can be found in linear time of the lengths of the reads through a well-designed suffix tree structure. Then, DTA-SiST constructs splicing graphs based on contigs for each gene locus. Finally, DTA-SiST proposes two strategies to extract transcript-representing paths: a depth-first enumeration strategy and a hybrid strategy based on length and coverage. We implemented the above two strategies and compared them with the state-of-the-art de novo assemblers on both simulated and real datasets. Experimental results showed that the depth-first enumeration strategy performs always better with recall and also better with precision for smaller datasets while the hybrid strategy leads with precision for big datasets. Conclusions DTA-SiST performs more competitive than the other compared de novo assemblers especially with precision measure, due to the read-based contig extension strategy and the elegant transcripts extraction rules.


1995 ◽  
Vol 2 (16) ◽  
Author(s):  
Dany Breslauer ◽  
Ramesh Hariharan

This paper gives optimal parallel algorithms for the construction of the smallest deterministic finite automata recognizing all the suffixes and the factors of a string. The algorithms use recently discovered optimal parallel suffix tree construction algorithms together with data structures for the efficient manipulation of trees, exploiting the well known relation between suffix and factor automata and suffix trees.


Author(s):  
Wing Kai Hon ◽  
Tsung Han Ku ◽  
R. Shah ◽  
S. V. Thankachan

1995 ◽  
Vol 2 (47) ◽  
Author(s):  
Dany Breslauer

This paper gives a linear-time algorithm for the construction of the<br />suffix tree of a tree. The suffix tree of a tree is used to obtain an efficient<br />algorithm for the minimization of sequential transducers.


Sign in / Sign up

Export Citation Format

Share Document