scholarly journals Designing efficient algorithms for querying large corpora

2021 ◽  
Vol 11 (2) ◽  
pp. 283-302
Author(s):  
Paul Meurer

I describe several new efficient algorithms for querying large annotated corpora. The search algorithms as they are implemented in several popular corpus search engines are less than optimal in two respects: regular expression string matching in the lexicon is done in linear time, and regular expressions over corpus positions are evaluated starting in those corpus positions that match the constraints of the initial edges of the corresponding network. To address these shortcomings, I have developed an algorithm for regular expression matching on suffix arrays that allows fast lexicon lookup, and a technique for running finite state automata from edges with lowest corpus counts. The implementation of the lexicon as suffix array also lends itself to an elegant and efficient treatment of multi-valued and set-valued attributes. The described techniques have been implemented in a fully functional corpus management system and are also used in a treebank query system.

2009 ◽  
Vol 2009 ◽  
pp. 1-10 ◽  
Author(s):  
Yi-Hua E. Yang ◽  
Viktor K. Prasanna

We present a software toolchain for constructing large-scaleregular expression matching(REM) on FPGA. The software automates the conversion of regular expressions into compact and high-performance nondeterministic finite automata (RE-NFA). Each RE-NFA is described as an RTL regular expression matching engine (REME) in VHDL for FPGA implementation. Assuming a fixed number of fan-out transitions per state, ann-statem-bytes-per-cycle RE-NFA can be constructed inO(n×m)time andO(n×m)memory by our software. A large number of RE-NFAs are placed onto a two-dimensionalstaged pipeline, allowing scalability to thousands of RE-NFAs with linear area increase and little clock rate penalty due to scaling. On a PC with a 2 GHz Athlon64 processor and 2 GB memory, our prototype software constructs hundreds of RE-NFAs used by Snort in less than 10 seconds. We also designed a benchmark generator which can produce RE-NFAs with configurable pattern complexity parameters, including state count, state fan-in, loop-back and feed-forward distances. Several regular expressions with various complexities are used to test the performance of our RE-NFA construction software.


1996 ◽  
Vol 2 (4) ◽  
pp. 305-328 ◽  
Author(s):  
L. KARTTUNEN ◽  
J-P. CHANOD ◽  
G. GREFENSTETTE ◽  
A. SCHILLE

Many of the processing steps in natural language engineering can be performed using finite state transducers. An optimal way to create such transducers is to compile them from regular expressions. This paper is an introduction to the regular expression calculus, extended with certain operators that have proved very useful in natural language applications ranging from tokenization to light parsing. The examples in the paper illustrate in concrete detail some of these applications.


2021 ◽  
Vol 31 ◽  
Author(s):  
ANDRZEJ FILINSKI

Abstract We show how to systematically derive an efficient regular expression (regex) matcher using a variety of program transformation techniques, but very little specialized formal language and automata theory. Starting from the standard specification of the set-theoretic semantics of regular expressions, we proceed via a continuation-based backtracking matcher, to a classical, table-driven state machine. All steps of the development are supported by self-contained (and machine-verified) equational correctness proofs.


2011 ◽  
Vol 60 (10) ◽  
pp. 1471-1484 ◽  
Author(s):  
Ge Nong ◽  
Sen Zhang ◽  
Wai Hong Chan

2009 ◽  
Vol 20 (06) ◽  
pp. 1069-1086
Author(s):  
WIKUS COETSER ◽  
DERRICK G. KOURIE ◽  
BRUCE W. WATSON

The consequences of regular expression hashing as a means of finite state automaton reduction is explored, based on variations of Brzozowski's algorithm. In this approach, each hash collision results in the merging of the automaton's states, and it is subsequently shown that a super-automaton will always be constructed, regardless of the hash function used. Since direct adaptation of the classical Brzozowski algorithm leads to a non-deterministic super-automaton, a new algorithm is put forward for constructing a deterministic FA. Approaches are proposed for measuring the quality of a hash function. These ideas are empirically tested on a large sample of relatively small regular expressions and their associated automata, as well as on a small sample of relatively large regular expressions. Differences in the quality of tested hash functions are observed. Possible reasons for this are mentioned, but future empirical work is required to investigate the matter.


2009 ◽  
Vol 20 (06) ◽  
pp. 1109-1133 ◽  
Author(s):  
JIE LIN ◽  
YUE JIANG ◽  
DON ADJEROH

We introduce the VST (virtual suffix tree), an efficient data structure for suffix trees and suffix arrays. Starting from the suffix array, we construct the suffix tree, from which we derive the virtual suffix tree. Later, we remove the intermediate step of suffix tree construction, and build the VST directly from the suffix array. The VST provides the same functionality as the suffix tree, including suffix links, but at a much smaller space requirement. It has the same linear time construction even for large alphabets, Σ, requires O(n) space to store (n is the string length), and allows searching for a pattern of length m to be performed in O(m log |Σ|) time, the same time needed for a suffix tree. Given the VST, we show an algorithm that computes all the suffix links in linear time, independent of Σ. The VST requires less space than other recently proposed data structures for suffix trees and suffix arrays, such as the enhanced suffix array [1], and the linearized suffix tree [17]. On average, the space requirement (including that for suffix arrays and suffix links) is 13.8n bytes for the regular VST, and 12.05n bytes in its compact form.


2009 ◽  
Vol 19 (2) ◽  
pp. 173-190 ◽  
Author(s):  
SCOTT OWENS ◽  
JOHN REPPY ◽  
AARON TURON

AbstractRegular-expression derivatives are an old, but elegant, technique for compiling regular expressions to deterministic finite-state machines. It easily supports extending the regular-expression operators with boolean operations, such as intersection and complement. Unfortunately, this technique has been lost in the sands of time and few computer scientists are aware of it. In this paper, we reexamine regular-expression derivatives and report on our experiences in the context of two different functional-language implementations. The basic implementation is simple and we show how to extend it to handle large character sets (e.g., Unicode). We also show that the derivatives approach leads to smaller state machines than the traditional algorithm given by McNaughton and Yamada.


2007 ◽  
Vol 17 (01) ◽  
pp. 141-154 ◽  
Author(s):  
J.-M. CHAMPARNAUD ◽  
F. OUARDI ◽  
D. ZIADI

There exist two well-known quotients of the position automaton of a regular expression. The first one, called the equation automaton, was first introduced by Mirkin from the notion of prebase and has been redefined by Antimirov from the notion of partial derivative. The second one, due to Ilie and Yu and called the follow automaton, can be obtained by eliminating ε-transitions in an ε-NFA that is always smaller than the classical ε-NFAs (Thompson, Sippu and Soisalon–Soininen). Ilie and Yu discussed the difficulty of succeeding in a theoretical comparison between the size of the follow automaton and the size of the equation automaton and concluded that it is very likely necessary to realize experimental studies. In this paper we solve the theoretical question, by first defining a set of regular expressions, called normalized expressions, such that every regular expression can be normalized in linear time, and proving then that the equation automaton of a normalized expression is always smaller than its follow automaton.


2013 ◽  
Vol 7 (1) ◽  
pp. 46-50
Author(s):  
Linhai Cui ◽  
Yusen Qin ◽  
Fanyang Kong ◽  
Kaihong Yu

This paper presents an efficient method for Regular Expression Matching (REM) by reusing Intellectual Property (IP) cores in a new architecture of Network on Chip (NoC). The method is to design a reusable IP core which consists of many engine cells for REM and to implement each engine cell on a Field Programmable Gate Array (FPGA) as a prototype. To make Finite State Machine (FSM) simpler, a new approach for partitioning a regular expression into several smaller parts is proposed. Each part of a regular expression was matched by an engine cell during matching, and each engine cell communicates with others by routers on a NoC topology. The proposed NoC architecture is a general-purpose design which is suitable for different rule libraries in deep packet inspection (DPI). It can deal with the problem that character self-deplete made the correct regular expression matching missing. A way to use both logic cell and RAM available on FPGA devices is described, and it can make it easier to change the rule library of regular expressions in the RAM. The implementation of the NoC architecture by employing application-specific integrated circuits (ASIC) is finally discussed.


Sign in / Sign up

Export Citation Format

Share Document