scholarly journals Non-Overlapping LZ77 Factorization and LZ78 Substring Compression Queries with Suffix Trees

Algorithms ◽  
2021 ◽  
Vol 14 (2) ◽  
pp. 44
Author(s):  
Dominik Köppl

We present algorithms computing the non-overlapping Lempel–Ziv-77 factorization and the longest previous non-overlapping factor table within small space in linear or near-linear time with the help of modern suffix tree representations fitting into limited space. With similar techniques, we show how to answer substring compression queries for the Lempel–Ziv-78 factorization with a possible logarithmic multiplicative slowdown depending on the used suffix tree representation.

Algorithms ◽  
2021 ◽  
Vol 14 (6) ◽  
pp. 161
Author(s):  
Dominik Köppl

We present linear-time algorithms computing the reversed Lempel–Ziv factorization [Kolpakov and Kucherov, TCS’09] within the space bounds of two different suffix tree representations. We can adapt these algorithms to compute the longest previous non-overlapping reverse factor table [Crochemore et al., JDA’12] within the same space but pay a multiplicative logarithmic time penalty.


Author(s):  
R. Giancarlo ◽  
R. Grossi

We discuss the suffix tree generalization to matrices in this chapter. We extend the suffix tree notion (described in Chapter 3) from text strings to text matrices whose entries are taken from an ordered alphabet with the aim of solving pattern-matching problems. This suffix tree generalization can be efficiently used to implement low-level routines for Computer Vision, Data Compression, Geographic Information Systems and Visual Databases. We examine the submatrices in the form of the text’s contiguous parts that still have a matrix shape. Representing these text submatrices as “suitably formatted” strings stored in a compacted trie is the rationale behind suffix trees for matrices. The choice of the format inevitably influences suffix tree construction time and space complexity. We first deal with square matrices and show that many suffix tree families can be defined for the same input matrix according to the matrix’s string representations. We can store each suffix tree in linear space and give an efficient construction algorithm whose input is both the matrix and the string representation chosen. We then treat rectangular matrices and define their corresponding suffix trees by means of some general rules which we list formally. We show that there is a super-linear lower bound to the space required (in contrast with the linear space required by suffix trees for square matrices). We give a simple example of one of these suffix trees. The last part of the chapter illustrates some technical results regarding suffix trees for square matrices: we show how to achieve an expected linear-time suffix tree construction for a constant-size alphabet under some mild probabilistic assumptions about the input distribution. We begin by defining a wide class of string representations for square matrices. We let Σ denote an ordered alphabet of characters and introduce another alphabet of five special characters, called shapes. A shape is one of the special characters taken from set {IN,SW,NW,SE,NE}. Shape IN encodes the 1x1 matrix generated from the empty matrix by creating a square.


2009 ◽  
Vol 20 (06) ◽  
pp. 1109-1133 ◽  
Author(s):  
JIE LIN ◽  
YUE JIANG ◽  
DON ADJEROH

We introduce the VST (virtual suffix tree), an efficient data structure for suffix trees and suffix arrays. Starting from the suffix array, we construct the suffix tree, from which we derive the virtual suffix tree. Later, we remove the intermediate step of suffix tree construction, and build the VST directly from the suffix array. The VST provides the same functionality as the suffix tree, including suffix links, but at a much smaller space requirement. It has the same linear time construction even for large alphabets, Σ, requires O(n) space to store (n is the string length), and allows searching for a pattern of length m to be performed in O(m log |Σ|) time, the same time needed for a suffix tree. Given the VST, we show an algorithm that computes all the suffix links in linear time, independent of Σ. The VST requires less space than other recently proposed data structures for suffix trees and suffix arrays, such as the enhanced suffix array [1], and the linearized suffix tree [17]. On average, the space requirement (including that for suffix arrays and suffix links) is 13.8n bytes for the regular VST, and 12.05n bytes in its compact form.


2019 ◽  
Vol 20 (S25) ◽  
Author(s):  
Jin Zhao ◽  
Haodi Feng ◽  
Daming Zhu ◽  
Chi Zhang ◽  
Ying Xu

Abstract Background Alternative splicing allows the pre-mRNAs of a gene to be spliced into various mRNAs, which greatly increases the diversity of proteins. High-throughput sequencing of mRNAs has revolutionized our ability for transcripts reconstruction. However, the massive size of short reads makes de novo transcripts assembly an algorithmic challenge. Results We develop a novel radical framework, called DTA-SiST, for de novo transcriptome assembly based on suffix trees. DTA-SiST first extends contigs by reads that have the longest overlaps with the contigs’ terminuses. These reads can be found in linear time of the lengths of the reads through a well-designed suffix tree structure. Then, DTA-SiST constructs splicing graphs based on contigs for each gene locus. Finally, DTA-SiST proposes two strategies to extract transcript-representing paths: a depth-first enumeration strategy and a hybrid strategy based on length and coverage. We implemented the above two strategies and compared them with the state-of-the-art de novo assemblers on both simulated and real datasets. Experimental results showed that the depth-first enumeration strategy performs always better with recall and also better with precision for smaller datasets while the hybrid strategy leads with precision for big datasets. Conclusions DTA-SiST performs more competitive than the other compared de novo assemblers especially with precision measure, due to the read-based contig extension strategy and the elegant transcripts extraction rules.


1995 ◽  
Vol 2 (47) ◽  
Author(s):  
Dany Breslauer

This paper gives a linear-time algorithm for the construction of the<br />suffix tree of a tree. The suffix tree of a tree is used to obtain an efficient<br />algorithm for the minimization of sequential transducers.


Web Mining ◽  
2011 ◽  
pp. 322-338 ◽  
Author(s):  
Zhixiang Chen ◽  
Richard H. Fowler ◽  
Ada Wai-Chee Fu ◽  
Chunyue Wang

A maximal forward reference of a Web user is a longest consecutive sequence of Web pages visited by the user in a session without revisiting some previously visited page in the sequence. Efficient mining of frequent traversal path patterns, that is, large reference sequences of maximal forward references, from very large Web logs is a fundamental problem in Web mining. This chapter aims at designing algorithms for this problem with the best possible efficiency. First, two optimal linear time algorithms are designed for finding maximal forward references from Web logs. Second, two algorithms for mining frequent traversal path patterns are devised with the help of a fast construction of shallow generalized suffix trees over a very large alphabet. These two algorithms have respectively provable linear and sublinear time complexity, and their performances are analyzed in comparison with the a priori-like algorithms and the Ukkonen algorithm. It is shown that these two new algorithms are substantially more efficient than the a priori-like algorithms and the Ukkonen algorithm.


Algorithmica ◽  
2009 ◽  
Vol 59 (2) ◽  
pp. 269-297 ◽  
Author(s):  
Dong Kyue Kim ◽  
Joong Chae Na ◽  
Jeong Seop Sim ◽  
Kunsoo Park

Author(s):  
Krisztián Monostori ◽  
Arkady Zaslavsky ◽  
István Vajk

2006 ◽  
Vol 17 (06) ◽  
pp. 1385-1402 ◽  
Author(s):  
Élise Prieur ◽  
Thierry Lecroq

We present a first formal setting for suffix vectors that are space economical alternative data structures to suffix trees. We give two linear algorithms for converting a suffix tree into a suffix vector and conversely. We enrich suffix vectors with formulas for counting the number of occurrences of repeated substrings. We also propose an alternative implementation for suffix vectors that should outperform the existing one.


Sign in / Sign up

Export Citation Format

Share Document