Designing efficient algorithms for querying large corpora

I describe several new efficient algorithms for querying large annotated corpora. The search algorithms as they are implemented in several popular corpus search engines are less than optimal in two respects: regular expression string matching in the lexicon is done in linear time, and regular expressions over corpus positions are evaluated starting in those corpus positions that match the constraints of the initial edges of the corresponding network. To address these shortcomings, I have developed an algorithm for regular expression matching on suffix arrays that allows fast lexicon lookup, and a technique for running finite state automata from edges with lowest corpus counts. The implementation of the lexicon as suffix array also lends itself to an elegant and efficient treatment of multi-valued and set-valued attributes. The described techniques have been implemented in a fully functional corpus management system and are also used in a treebank query system.

Download Full-text

Software Toolchain for Large-Scale RE-NFA Construction on FPGA

International Journal of Reconfigurable Computing ◽

10.1155/2009/301512 ◽

2009 ◽

Vol 2009 ◽

pp. 1-10 ◽

Cited By ~ 3

Author(s):

Yi-Hua E. Yang ◽

Viktor K. Prasanna

Keyword(s):

High Performance ◽

Large Scale ◽

Regular Expression ◽

Finite Automata ◽

Fixed Number ◽

Regular Expressions ◽

Pattern Complexity ◽

Regular Expression Matching ◽

Area Increase ◽

Prototype Software

We present a software toolchain for constructing large-scaleregular expression matching(REM) on FPGA. The software automates the conversion of regular expressions into compact and high-performance nondeterministic finite automata (RE-NFA). Each RE-NFA is described as an RTL regular expression matching engine (REME) in VHDL for FPGA implementation. Assuming a fixed number of fan-out transitions per state, ann-statem-bytes-per-cycle RE-NFA can be constructed inO(n×m)time andO(n×m)memory by our software. A large number of RE-NFAs are placed onto a two-dimensionalstaged pipeline, allowing scalability to thousands of RE-NFAs with linear area increase and little clock rate penalty due to scaling. On a PC with a 2 GHz Athlon64 processor and 2 GB memory, our prototype software constructs hundreds of RE-NFAs used by Snort in less than 10 seconds. We also designed a benchmark generator which can produce RE-NFAs with configurable pattern complexity parameters, including state count, state fan-in, loop-back and feed-forward distances. Several regular expressions with various complexities are used to test the performance of our RE-NFA construction software.

Download Full-text

Parallel Finite State Machines for Very Fast Distributable Regular Expression Matching

Proceedings of the 7th International Conference on Software Paradigm Trends ◽

10.5220/0003949901050110 ◽

2012 ◽

Keyword(s):

Regular Expression ◽

Finite State Machines ◽

State Machines ◽

Regular Expression Matching ◽

Finite State

Download Full-text

Regular expressions for language engineering

Natural Language Engineering ◽

10.1017/s1351324997001563 ◽

1996 ◽

Vol 2 (4) ◽

pp. 305-328 ◽

Cited By ~ 46

Author(s):

L. KARTTUNEN ◽

J-P. CHANOD ◽

G. GREFENSTETTE ◽

A. SCHILLE

Keyword(s):

Natural Language ◽

Regular Expression ◽

Regular Expressions ◽

Language Engineering ◽

Finite State Transducers ◽

Finite State ◽

Processing Steps

Many of the processing steps in natural language engineering can be performed using finite state transducers. An optimal way to create such transducers is to compile them from regular expressions. This paper is an introduction to the regular expression calculus, extended with certain operators that have proved very useful in natural language applications ranging from tokenization to light parsing. The examples in the paper illustrate in concrete detail some of these applications.

Download Full-text

Proof-directed program transformation: A functional account of efficient regular expression matching

Journal of Functional Programming ◽

10.1017/s0956796820000295 ◽

2021 ◽

Vol 31 ◽

Author(s):

ANDRZEJ FILINSKI

Keyword(s):

Program Transformation ◽

Formal Language ◽

Regular Expression ◽

State Machine ◽

Automata Theory ◽

Regular Expressions ◽

Transformation Techniques ◽

Standard Specification ◽

Correctness Proofs ◽

Regular Expression Matching

Abstract We show how to systematically derive an efficient regular expression (regex) matcher using a variety of program transformation techniques, but very little specialized formal language and automata theory. Starting from the standard specification of the set-theoretic semantics of regular expressions, we proceed via a continuation-based backtracking matcher, to a classical, table-driven state machine. All steps of the development are supported by self-contained (and machine-verified) equational correctness proofs.

Download Full-text

Two Efficient Algorithms for Linear Time Suffix Array Construction

IEEE Transactions on Computers ◽

10.1109/tc.2010.188 ◽

2011 ◽

Vol 60 (10) ◽

pp. 1471-1484 ◽

Cited By ~ 78

Author(s):

Ge Nong ◽

Sen Zhang ◽

Wai Hong Chan

Keyword(s):

Linear Time ◽

Suffix Array ◽

Efficient Algorithms

Download Full-text

ON REGULAR EXPRESSION HASHING TO REDUCE FA SIZE

International Journal of Foundations of Computer Science ◽

10.1142/s0129054109007042 ◽

2009 ◽

Vol 20 (06) ◽

pp. 1069-1086

Author(s):

WIKUS COETSER ◽

DERRICK G. KOURIE ◽

BRUCE W. WATSON

Keyword(s):

Hash Function ◽

Regular Expression ◽

Empirical Work ◽

Hash Functions ◽

Small Sample ◽

Finite State Automaton ◽

Regular Expressions ◽

Large Sample ◽

Finite State

The consequences of regular expression hashing as a means of finite state automaton reduction is explored, based on variations of Brzozowski's algorithm. In this approach, each hash collision results in the merging of the automaton's states, and it is subsequently shown that a super-automaton will always be constructed, regardless of the hash function used. Since direct adaptation of the classical Brzozowski algorithm leads to a non-deterministic super-automaton, a new algorithm is put forward for constructing a deterministic FA. Approaches are proposed for measuring the quality of a hash function. These ideas are empirically tested on a large sample of relatively small regular expressions and their associated automata, as well as on a small sample of relatively large regular expressions. Differences in the quality of tested hash functions are observed. Possible reasons for this are mentioned, but future empirical work is required to investigate the matter.

Download Full-text

THE VIRTUAL SUFFIX TREE

International Journal of Foundations of Computer Science ◽

10.1142/s0129054109007066 ◽

2009 ◽

Vol 20 (06) ◽

pp. 1109-1133 ◽

Cited By ~ 2

Author(s):

JIE LIN ◽

YUE JIANG ◽

DON ADJEROH

Keyword(s):

Suffix Tree ◽

Linear Time ◽

Suffix Array ◽

Intermediate Step ◽

Suffix Trees ◽

String Length ◽

Space Requirement ◽

Suffix Arrays ◽

Tree Construction ◽

Efficient Data

We introduce the VST (virtual suffix tree), an efficient data structure for suffix trees and suffix arrays. Starting from the suffix array, we construct the suffix tree, from which we derive the virtual suffix tree. Later, we remove the intermediate step of suffix tree construction, and build the VST directly from the suffix array. The VST provides the same functionality as the suffix tree, including suffix links, but at a much smaller space requirement. It has the same linear time construction even for large alphabets, Σ, requires O(n) space to store (n is the string length), and allows searching for a pattern of length m to be performed in O(m log |Σ|) time, the same time needed for a suffix tree. Given the VST, we show an algorithm that computes all the suffix links in linear time, independent of Σ. The VST requires less space than other recently proposed data structures for suffix trees and suffix arrays, such as the enhanced suffix array [1], and the linearized suffix tree [17]. On average, the space requirement (including that for suffix arrays and suffix links) is 13.8n bytes for the regular VST, and 12.05n bytes in its compact form.

Download Full-text

Regular-expression derivatives re-examined

Journal of Functional Programming ◽

10.1017/s0956796808007090 ◽

2009 ◽

Vol 19 (2) ◽

pp. 173-190 ◽

Cited By ~ 51

Author(s):

SCOTT OWENS ◽

JOHN REPPY ◽

AARON TURON

Keyword(s):

Regular Expression ◽

Finite State Machines ◽

Regular Expressions ◽

Functional Language ◽

State Machines ◽

Boolean Operations ◽

Traditional Algorithm ◽

Computer Scientists ◽

Finite State

AbstractRegular-expression derivatives are an old, but elegant, technique for compiling regular expressions to deterministic finite-state machines. It easily supports extending the regular-expression operators with boolean operations, such as intersection and complement. Unfortunately, this technique has been lost in the sands of time and few computer scientists are aware of it. In this paper, we reexamine regular-expression derivatives and report on our experiences in the context of two different functional-language implementations. The basic implementation is simple and we show how to extend it to handle large character sets (e.g., Unicode). We also show that the derivatives approach leads to smaller state machines than the traditional algorithm given by McNaughton and Yamada.

Download Full-text

NORMALIZED EXPRESSIONS AND FINITE AUTOMATA

International Journal of Algebra and Computation ◽

10.1142/s021819670700355x ◽

2007 ◽

Vol 17 (01) ◽

pp. 141-154 ◽

Cited By ~ 11

Author(s):

J.-M. CHAMPARNAUD ◽

F. OUARDI ◽

D. ZIADI

Keyword(s):

Partial Derivative ◽

Regular Expression ◽

Linear Time ◽

Finite Automata ◽

Experimental Studies ◽

Regular Expressions ◽

Theoretical Comparison ◽

Theoretical Question

There exist two well-known quotients of the position automaton of a regular expression. The first one, called the equation automaton, was first introduced by Mirkin from the notion of prebase and has been redefined by Antimirov from the notion of partial derivative. The second one, due to Ilie and Yu and called the follow automaton, can be obtained by eliminating ε-transitions in an ε-NFA that is always smaller than the classical ε-NFAs (Thompson, Sippu and Soisalon–Soininen). Ilie and Yu discussed the difficulty of succeeding in a theoretical comparison between the size of the follow automaton and the size of the equation automaton and concluded that it is very likely necessary to realize experimental studies. In this paper we solve the theoretical question, by first defining a set of regular expressions, called normalized expressions, such that every regular expression can be normalized in linear time, and proving then that the equation automaton of a normalized expression is always smaller than its follow automaton.

Download Full-text

Design of a Regular Expression Matching System Based on Network on Chip

The Open Electrical & Electronic Engineering Journal ◽

10.2174/1874129001307010046 ◽

2013 ◽

Vol 7 (1) ◽

pp. 46-50

Author(s):

Linhai Cui ◽

Yusen Qin ◽

Fanyang Kong ◽

Kaihong Yu

Keyword(s):

Integrated Circuits ◽

Regular Expression ◽

General Purpose ◽

Network On Chip ◽

Deep Packet Inspection ◽

Regular Expression Matching ◽

Finite State ◽

Ip Cores ◽

Packet Inspection ◽

On Chip

This paper presents an efficient method for Regular Expression Matching (REM) by reusing Intellectual Property (IP) cores in a new architecture of Network on Chip (NoC). The method is to design a reusable IP core which consists of many engine cells for REM and to implement each engine cell on a Field Programmable Gate Array (FPGA) as a prototype. To make Finite State Machine (FSM) simpler, a new approach for partitioning a regular expression into several smaller parts is proposed. Each part of a regular expression was matched by an engine cell during matching, and each engine cell communicates with others by routers on a NoC topology. The proposed NoC architecture is a general-purpose design which is suitable for different rule libraries in deep packet inspection (DPI). It can deal with the problem that character self-deplete made the correct regular expression matching missing. A way to use both logic cell and RAM available on FPGA devices is described, and it can make it easier to change the rule library of regular expressions in the RAM. The implementation of the NoC architecture by employing application-specific integrated circuits (ASIC) is finally discussed.

Download Full-text