Reversed Lempel–Ziv Factorization with Suffix Trees

We present linear-time algorithms computing the reversed Lempel–Ziv factorization [Kolpakov and Kucherov, TCS’09] within the space bounds of two different suffix tree representations. We can adapt these algorithms to compute the longest previous non-overlapping reverse factor table [Crochemore et al., JDA’12] within the same space but pay a multiplicative logarithmic time penalty.

Download Full-text

Non-Overlapping LZ77 Factorization and LZ78 Substring Compression Queries with Suffix Trees

Algorithms ◽

10.3390/a14020044 ◽

2021 ◽

Vol 14 (2) ◽

pp. 44

Author(s):

Dominik Köppl

Keyword(s):

Suffix Tree ◽

Linear Time ◽

Suffix Trees ◽

Small Space ◽

Tree Representation ◽

Limited Space ◽

Tree Representations

We present algorithms computing the non-overlapping Lempel–Ziv-77 factorization and the longest previous non-overlapping factor table within small space in linear or near-linear time with the help of modern suffix tree representations fitting into limited space. With similar techniques, we show how to answer substring compression queries for the Lempel–Ziv-78 factorization with a possible logarithmic multiplicative slowdown depending on the used suffix tree representation.

Download Full-text

Efficient Web Mining for Traversal Path Patterns

Web Mining ◽

10.4018/978-1-59140-414-9.ch015 ◽

2011 ◽

pp. 322-338 ◽

Cited By ~ 1

Author(s):

Zhixiang Chen ◽

Richard H. Fowler ◽

Ada Wai-Chee Fu ◽

Chunyue Wang

Keyword(s):

Web Mining ◽

Linear Time ◽

Fundamental Problem ◽

A Priori ◽

Web Pages ◽

Suffix Trees ◽

Web Logs ◽

Large Alphabet ◽

Optimal Linear ◽

Linear Time Algorithms

A maximal forward reference of a Web user is a longest consecutive sequence of Web pages visited by the user in a session without revisiting some previously visited page in the sequence. Efficient mining of frequent traversal path patterns, that is, large reference sequences of maximal forward references, from very large Web logs is a fundamental problem in Web mining. This chapter aims at designing algorithms for this problem with the best possible efficiency. First, two optimal linear time algorithms are designed for finding maximal forward references from Web logs. Second, two algorithms for mining frequent traversal path patterns are devised with the help of a fast construction of shallow generalized suffix trees over a very large alphabet. These two algorithms have respectively provable linear and sublinear time complexity, and their performances are analyzed in comparison with the a priori-like algorithms and the Ukkonen algorithm. It is shown that these two new algorithms are substantially more efficient than the a priori-like algorithms and the Ukkonen algorithm.

Download Full-text

Suffix Tree Data Structures for Matrices

Pattern Matching Algorithms ◽

10.1093/oso/9780195113679.003.0013 ◽

1997 ◽

Author(s):

R. Giancarlo ◽

R. Grossi

Keyword(s):

Linear Space ◽

Suffix Tree ◽

Linear Time ◽

Suffix Trees ◽

Construction Time ◽

Matching Problems ◽

Tree Construction ◽

The Matrix ◽

Visual Databases ◽

Efficient Construction

We discuss the suffix tree generalization to matrices in this chapter. We extend the suffix tree notion (described in Chapter 3) from text strings to text matrices whose entries are taken from an ordered alphabet with the aim of solving pattern-matching problems. This suffix tree generalization can be efficiently used to implement low-level routines for Computer Vision, Data Compression, Geographic Information Systems and Visual Databases. We examine the submatrices in the form of the text’s contiguous parts that still have a matrix shape. Representing these text submatrices as “suitably formatted” strings stored in a compacted trie is the rationale behind suffix trees for matrices. The choice of the format inevitably influences suffix tree construction time and space complexity. We first deal with square matrices and show that many suffix tree families can be defined for the same input matrix according to the matrix’s string representations. We can store each suffix tree in linear space and give an efficient construction algorithm whose input is both the matrix and the string representation chosen. We then treat rectangular matrices and define their corresponding suffix trees by means of some general rules which we list formally. We show that there is a super-linear lower bound to the space required (in contrast with the linear space required by suffix trees for square matrices). We give a simple example of one of these suffix trees. The last part of the chapter illustrates some technical results regarding suffix trees for square matrices: we show how to achieve an expected linear-time suffix tree construction for a constant-size alphabet under some mild probabilistic assumptions about the input distribution. We begin by defining a wide class of string representations for square matrices. We let Σ denote an ordered alphabet of characters and introduce another alphabet of five special characters, called shapes. A shape is one of the special characters taken from set {IN,SW,NW,SE,NE}. Shape IN encodes the 1x1 matrix generated from the empty matrix by creating a square.

Download Full-text

THE VIRTUAL SUFFIX TREE

International Journal of Foundations of Computer Science ◽

10.1142/s0129054109007066 ◽

2009 ◽

Vol 20 (06) ◽

pp. 1109-1133 ◽

Cited By ~ 2

Author(s):

JIE LIN ◽

YUE JIANG ◽

DON ADJEROH

Keyword(s):

Suffix Tree ◽

Linear Time ◽

Suffix Array ◽

Intermediate Step ◽

Suffix Trees ◽

String Length ◽

Space Requirement ◽

Suffix Arrays ◽

Tree Construction ◽

Efficient Data

We introduce the VST (virtual suffix tree), an efficient data structure for suffix trees and suffix arrays. Starting from the suffix array, we construct the suffix tree, from which we derive the virtual suffix tree. Later, we remove the intermediate step of suffix tree construction, and build the VST directly from the suffix array. The VST provides the same functionality as the suffix tree, including suffix links, but at a much smaller space requirement. It has the same linear time construction even for large alphabets, Σ, requires O(n) space to store (n is the string length), and allows searching for a pattern of length m to be performed in O(m log |Σ|) time, the same time needed for a suffix tree. Given the VST, we show an algorithm that computes all the suffix links in linear time, independent of Σ. The VST requires less space than other recently proposed data structures for suffix trees and suffix arrays, such as the enhanced suffix array [1], and the linearized suffix tree [17]. On average, the space requirement (including that for suffix arrays and suffix links) is 13.8n bytes for the regular VST, and 12.05n bytes in its compact form.

Download Full-text

DTA-SiST: de novo transcriptome assembly by using simplified suffix trees

BMC Bioinformatics ◽

10.1186/s12859-019-3272-9 ◽

2019 ◽

Vol 20 (S25) ◽

Author(s):

Jin Zhao ◽

Haodi Feng ◽

Daming Zhu ◽

Chi Zhang ◽

Ying Xu

Keyword(s):

Suffix Tree ◽

High Throughput Sequencing ◽

De Novo ◽

State Of The Art ◽

Linear Time ◽

Transcriptome Assembly ◽

De Novo Transcriptome Assembly ◽

Suffix Trees ◽

De Novo Transcriptome ◽

Hybrid Strategy

Abstract Background Alternative splicing allows the pre-mRNAs of a gene to be spliced into various mRNAs, which greatly increases the diversity of proteins. High-throughput sequencing of mRNAs has revolutionized our ability for transcripts reconstruction. However, the massive size of short reads makes de novo transcripts assembly an algorithmic challenge. Results We develop a novel radical framework, called DTA-SiST, for de novo transcriptome assembly based on suffix trees. DTA-SiST first extends contigs by reads that have the longest overlaps with the contigs’ terminuses. These reads can be found in linear time of the lengths of the reads through a well-designed suffix tree structure. Then, DTA-SiST constructs splicing graphs based on contigs for each gene locus. Finally, DTA-SiST proposes two strategies to extract transcript-representing paths: a depth-first enumeration strategy and a hybrid strategy based on length and coverage. We implemented the above two strategies and compared them with the state-of-the-art de novo assemblers on both simulated and real datasets. Experimental results showed that the depth-first enumeration strategy performs always better with recall and also better with precision for smaller datasets while the hybrid strategy leads with precision for big datasets. Conclusions DTA-SiST performs more competitive than the other compared de novo assemblers especially with precision measure, due to the read-based contig extension strategy and the elegant transcripts extraction rules.

Download Full-text

CROCHEMORE'S REPETITIONS ALGORITHM REVISITED: COMPUTING RUNS

International Journal of Foundations of Computer Science ◽

10.1142/s0129054112400199 ◽

2012 ◽

Vol 23 (02) ◽

pp. 389-401

Author(s):

FRANTISEK FRANEK ◽

MEI JIANG

Keyword(s):

Suffix Tree ◽

Linear Time ◽

Parallel Implementation ◽

Worst Case ◽

Original Algorithm ◽

The Third ◽

Similar Strategy ◽

Linear Time Algorithms

Crochemore's repetitions algorithm introduced in 1981 was the first O(n log n) algorithm for computing repetitions. Since then, several linear-time worst-case algorithms for computing runs have been introduced. They all follow a similar strategy: first compute the suffix tree or array, then use the suffix tree or array to compute the Lempel-Ziv factorization, then using the Lempel-Ziv factorization compute all the runs. It is conceivable that in practice an extension of Crochemore's repetitions algorithm may outperform the linear-time algorithms, or at least for certain classes of strings. The nature of Crochemore's algorithm lends itself naturally to parallelization, while the linear-time algorithms are not easily conducive to parallelization. For all these reasons it is interesting to explore ways to extend the original Crochemore's repetitions algorithm to compute runs. We present three variants of extending the repetitions algorithm to compute runs: two with a worsen complexity of O(n ( log n)2), and one with the same complexity as the original algorithm. The three variants are tested for speed of performance and their memory requirements are analyzed. The third variant is tested and analyzed for various memory-saving alterations. The purpose of this research is to identify the best extension of Crochemore's algorithm for further study, comparison with other algorithms, and parallel implementation.

Download Full-text