Linear-Time Construction of Suffix Arrays

I describe several new efficient algorithms for querying large annotated corpora. The search algorithms as they are implemented in several popular corpus search engines are less than optimal in two respects: regular expression string matching in the lexicon is done in linear time, and regular expressions over corpus positions are evaluated starting in those corpus positions that match the constraints of the initial edges of the corresponding network. To address these shortcomings, I have developed an algorithm for regular expression matching on suffix arrays that allows fast lexicon lookup, and a technique for running finite state automata from edges with lowest corpus counts. The implementation of the lexicon as suffix array also lends itself to an elegant and efficient treatment of multi-valued and set-valued attributes. The described techniques have been implemented in a fully functional corpus management system and are also used in a treebank query system.

Download Full-text

Linear-Time Longest-Common-Prefix Computation in Suffix Arrays and Its Applications

Combinatorial Pattern Matching - Lecture Notes in Computer Science ◽

10.1007/3-540-48194-x_17 ◽

2001 ◽

pp. 181-192 ◽

Cited By ~ 197

Author(s):

Toru Kasai ◽

Gunho Lee ◽

Hiroki Arimura ◽

Setsuo Arikawa ◽

Kunsoo Park

Keyword(s):

Linear Time ◽

Suffix Arrays ◽

Common Prefix

Download Full-text

Alphabet-independent linear-time construction of compressed suffix arrays using o(nlogn)-bit working space

Theoretical Computer Science ◽

10.1016/j.tcs.2007.05.030 ◽

2007 ◽

Vol 385 (1-3) ◽

pp. 127-136 ◽

Cited By ~ 11

Author(s):

Joong Chae Na ◽

Kunsoo Park

Keyword(s):

Linear Time ◽

Suffix Arrays ◽

Working Space

Download Full-text

THE VIRTUAL SUFFIX TREE

International Journal of Foundations of Computer Science ◽

10.1142/s0129054109007066 ◽

2009 ◽

Vol 20 (06) ◽

pp. 1109-1133 ◽

Cited By ~ 2

Author(s):

JIE LIN ◽

YUE JIANG ◽

DON ADJEROH

Keyword(s):

Suffix Tree ◽

Linear Time ◽

Suffix Array ◽

Intermediate Step ◽

Suffix Trees ◽

String Length ◽

Space Requirement ◽

Suffix Arrays ◽

Tree Construction ◽

Efficient Data

We introduce the VST (virtual suffix tree), an efficient data structure for suffix trees and suffix arrays. Starting from the suffix array, we construct the suffix tree, from which we derive the virtual suffix tree. Later, we remove the intermediate step of suffix tree construction, and build the VST directly from the suffix array. The VST provides the same functionality as the suffix tree, including suffix links, but at a much smaller space requirement. It has the same linear time construction even for large alphabets, Σ, requires O(n) space to store (n is the string length), and allows searching for a pattern of length m to be performed in O(m log |Σ|) time, the same time needed for a suffix tree. Given the VST, we show an algorithm that computes all the suffix links in linear time, independent of Σ. The VST requires less space than other recently proposed data structures for suffix trees and suffix arrays, such as the enhanced suffix array [1], and the linearized suffix tree [17]. On average, the space requirement (including that for suffix arrays and suffix links) is 13.8n bytes for the regular VST, and 12.05n bytes in its compact form.

Download Full-text

Constructing suffix arrays in linear time

Journal of Discrete Algorithms ◽

10.1016/j.jda.2004.08.019 ◽

2005 ◽

Vol 3 (2-4) ◽

pp. 126-142 ◽

Cited By ~ 64

Author(s):

Dong Kyue Kim ◽

Jeong Seop Sim ◽

Heejin Park ◽

Kunsoo Park

Keyword(s):

Linear Time ◽

Suffix Arrays

Download Full-text

Linear-Time Construction of Compressed Suffix Arrays Using o(n log n)-Bit Working Space for Large Alphabets

Combinatorial Pattern Matching - Lecture Notes in Computer Science ◽

10.1007/11496656_6 ◽

2005 ◽

pp. 57-67 ◽

Cited By ~ 6

Author(s):

Joong Chae Na

Keyword(s):

Linear Time ◽

Suffix Arrays ◽

Working Space

Download Full-text

Space efficient linear time construction of suffix arrays

Journal of Discrete Algorithms ◽

10.1016/j.jda.2004.08.002 ◽

2005 ◽

Vol 3 (2-4) ◽

pp. 143-156 ◽

Cited By ~ 107

Author(s):

Pang Ko ◽

Srinivas Aluru

Keyword(s):

Linear Time ◽

Suffix Arrays

Download Full-text

Accelerated preprocessing in task of searching substrings in a string

Vestnik of Don State Technical University ◽

10.23947/1992-5980-2019-19-3-290-300 ◽

2019 ◽

Vol 19 (3) ◽

pp. 290-300

Author(s):

A. V. Mazurenko ◽

N. V. Boldyrikhin

Keyword(s):

Database Management ◽

Linear Time ◽

Rapid Development ◽

Suffix Array ◽

Database Management Systems ◽

Management Systems ◽

Research Results ◽

Error Functions ◽

Suffix Arrays ◽

Associative Search

Introduction. A rapid development of the systems such as Yandex, Google, etc., has predetermined the relevance of the task of searching substrings in a string, and approaches to its solution are actively investigated. This task is used to create database management systems that support associative search. Besides, it is applicable in solving information security issues and creating antivirus programs. Algorithms of searching substring in a string are used in signature-based discovery tasks.Materials and Methods. The solution to the problem is based on the Aho-Corasick algorithm which is a typical technique of searching substrings in a string. At the same time, a new approach regarding preprocessing is employed.Research Results. The possibility of constructing the transition function and suffix references through suffix arrays and special mappings, is shown. The relationship between the prefix tree and suffix arrays was investigated, which provided the development of a fundamentally new method of constructing the transition and error functions. The results obtained enable to substantially shorten the time intervals spent on the preelection processing of a set of pattern strings when using an integer alphabet. The paper lists eight algorithms. The developed algorithms are evaluated. The results obtained are compared to the formerly known. Two theorems and eight lemmas are proved. Two examples illustrating features of the practical application of the developed preprocessing procedure are given.Discussion and Conclusions. The preprocessing procedure proposed in this paper is based on the communication between the suffix array built on the ground of a set of pattern strings and the construction of transition and error functions at the initial stages of the Aho-Corasick algorithm. This approach differs from the traditional one and requires the use of algorithms providing a suffix array in linear time. Thus, the algorithms that enable to significantly reduce the time for preprocessing of a set of pattern strings under the condition of using a certain type of alphabet in comparison to the known approach proposed in the Aho- Corasick algorithm are described. The research results presented in the paper can be used in antivirus programs that apply searching for signatures of malicious data objects in the memory of a computer system. In addition, this approach to solving the problem on searching substrings in a string will significantly speed up the operation of database management systems using associative search.

Download Full-text