Regular expressions for language engineering

Many of the processing steps in natural language engineering can be performed using finite state transducers. An optimal way to create such transducers is to compile them from regular expressions. This paper is an introduction to the regular expression calculus, extended with certain operators that have proved very useful in natural language applications ranging from tokenization to light parsing. The examples in the paper illustrate in concrete detail some of these applications.

Download Full-text

Finite-State Technology

10.1093/oxfordhb/9780199276349.013.0018 ◽

2012 ◽

Author(s):

Lauri Karttunen

Keyword(s):

Language Processing ◽

Regular Language ◽

Morphological Analysis ◽

Finite State Automata ◽

Regular Expressions ◽

Finite State Transducers ◽

Finite State ◽

Shallow Parsing ◽

Basic Concepts ◽

State Language

The article introduces the basic concepts of finite-state language processing: regular languages and relations, finite-state automata, and regular expressions. Many basic steps in language processing, ranging from tokenization, to phonological and morphological analysis, disambiguation, spelling correction, and shallow parsing, can be performed efficiently by means of finite-state transducers. The article discusses examples of finite-state languages and relations. Finite-state networks can represent only a subset of all possible languages and relations; that is, only some languages are finite-state languages. Furthermore, this article introduces two types of complex regular expressions that have many linguistic applications, restriction and replacement. Finally, the article discusses the properties of finite-state automata. The three important properties of networks are: that they are epsilon free, deterministic, and minimal. If a network encodes a regular language and if it is epsilon free, deterministic, and minimal, the network is guaranteed to be the best encoding for that language.

Download Full-text

Regular relations for temporal propositions

Natural Language Engineering ◽

10.1017/s135132491100009x ◽

2011 ◽

Vol 17 (2) ◽

pp. 163-184 ◽

Cited By ~ 1

Author(s):

TIM FERNANDO

Keyword(s):

Natural Language ◽

Information Content ◽

Truth Conditions ◽

Finite State Transducers ◽

Finite State

AbstractRelations computed by finite-state transducers are applied to interpret temporal propositions in terms of strings representing finite contexts or situations. Carnap–Montague intensions mapping indices to extensions are reformulated as relations between strings that can serve as indices and extensions alike. Strings are related according to information content, temporal span and granularity, the bounds on which reflect the partiality of natural language statements. That partiality shapes not only strings-as-extensions (indicating what statements are about) but also strings-as-indices (underlying truth conditions).

Download Full-text

Designing efficient algorithms for querying large corpora

Oslo Studies in Language ◽

10.5617/osla.8504 ◽

2021 ◽

Vol 11 (2) ◽

pp. 283-302

Author(s):

Paul Meurer

Keyword(s):

Regular Expression ◽

Linear Time ◽

Suffix Array ◽

Efficient Algorithms ◽

Regular Expressions ◽

Efficient Treatment ◽

Suffix Arrays ◽

Regular Expression Matching ◽

Finite State ◽

Query System

I describe several new efficient algorithms for querying large annotated corpora. The search algorithms as they are implemented in several popular corpus search engines are less than optimal in two respects: regular expression string matching in the lexicon is done in linear time, and regular expressions over corpus positions are evaluated starting in those corpus positions that match the constraints of the initial edges of the corresponding network. To address these shortcomings, I have developed an algorithm for regular expression matching on suffix arrays that allows fast lexicon lookup, and a technique for running finite state automata from edges with lowest corpus counts. The implementation of the lexicon as suffix array also lends itself to an elegant and efficient treatment of multi-valued and set-valued attributes. The described techniques have been implemented in a fully functional corpus management system and are also used in a treebank query system.

Download Full-text

An Extendible Regular Expression Compiler for Finite-State Approaches in Natural Language Processing

Lecture Notes in Computer Science - Automata Implementation ◽

10.1007/3-540-45526-4_12 ◽

2001 ◽

pp. 122-139 ◽

Cited By ~ 8

Author(s):

Gertjan van Noord ◽

Dale Gerdemann

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Regular Expression ◽

Finite State

Download Full-text

Applications of Finite-State Transducers in Natural Language Processing

Implementation and Application of Automata - Lecture Notes in Computer Science ◽

10.1007/3-540-44674-5_2 ◽

2001 ◽

pp. 34-46 ◽

Cited By ~ 6

Author(s):

Lauri Karttunen

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Finite State Transducers ◽

Finite State

Download Full-text

ON REGULAR EXPRESSION HASHING TO REDUCE FA SIZE

International Journal of Foundations of Computer Science ◽

10.1142/s0129054109007042 ◽

2009 ◽

Vol 20 (06) ◽

pp. 1069-1086

Author(s):

WIKUS COETSER ◽

DERRICK G. KOURIE ◽

BRUCE W. WATSON

Keyword(s):

Hash Function ◽

Regular Expression ◽

Empirical Work ◽

Hash Functions ◽

Small Sample ◽

Finite State Automaton ◽

Regular Expressions ◽

Large Sample ◽

Finite State

The consequences of regular expression hashing as a means of finite state automaton reduction is explored, based on variations of Brzozowski's algorithm. In this approach, each hash collision results in the merging of the automaton's states, and it is subsequently shown that a super-automaton will always be constructed, regardless of the hash function used. Since direct adaptation of the classical Brzozowski algorithm leads to a non-deterministic super-automaton, a new algorithm is put forward for constructing a deterministic FA. Approaches are proposed for measuring the quality of a hash function. These ideas are empirically tested on a large sample of relatively small regular expressions and their associated automata, as well as on a small sample of relatively large regular expressions. Differences in the quality of tested hash functions are observed. Possible reasons for this are mentioned, but future empirical work is required to investigate the matter.

Download Full-text

Sketch-Driven Regular Expression Generation from Natural Language and Examples

Transactions of the Association for Computational Linguistics ◽

10.1162/tacl_a_00339 ◽

2020 ◽

Vol 8 ◽

pp. 679-694

Author(s):

Xi Ye ◽

Qiaochu Chen ◽

Xinyu Wang ◽

Isil Dillig ◽

Greg Durrett

Keyword(s):

Natural Language ◽

Real World ◽

Regular Expression ◽

State Of The Art ◽

Neural Systems ◽

Regular Expressions ◽

Weak Supervision ◽

Stack Overflow ◽

Neural Maps ◽

The Given

Recent systems for converting natural language descriptions into regular expressions (regexes) have achieved some success, but typically deal with short, formulaic text and can only produce simple regexes. Real-world regexes are complex, hard to describe with brief sentences, and sometimes require examples to fully convey the user’s intent. We present a framework for regex synthesis in this setting where both natural language (NL) and examples are available. First, a semantic parser (either grammar-based or neural) maps the natural language description into an intermediate sketch, which is an incomplete regex containing holes to denote missing components. Then a program synthesizer searches over the regex space defined by the sketch and finds a regex that is consistent with the given string examples. Our semantic parser can be trained purely from weak supervision based on correctness of the synthesized regex, or it can leverage heuristically derived sketches. We evaluate on two prior datasets (Kushman and Barzilay 2013 ; Locascio et al. 2016 ) and a real-world dataset from Stack Overflow. Our system achieves state-of-the-art performance on the prior datasets and solves 57% of the real-world dataset, which existing neural systems completely fail on. 1

Download Full-text

Finite state methods in natural language processing

Natural Language Engineering ◽

10.1017/s1351324903003139 ◽

2003 ◽

Vol 9 (1) ◽

pp. 1-3 ◽

Cited By ~ 1

Author(s):

LAURI KARTTUNEN ◽

KIMMO KOSKENNIEMI ◽

GERTJAN VAN NOORD

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Computational Linguistics ◽

Language Processing ◽

Summer School ◽

Special Issue ◽

Language Engineering ◽

Finite State ◽

State Models ◽

Finite State Models

Finite state methods have been in common use in various areas of natural language processing (NLP) for many years. A series of specialized workshops in this area illustrates this. In 1996, András Kornai organized a very successful workshop entitled Extended Finite State Models of Language. One of the results of that workshop was a special issue of Natural Language Engineering (Volume 2, Number 4). In 1998, Kemal Oflazer organized a workshop called Finite State Methods in Natural Language Processing. A selection of submissions for this workshop were later included in a special issue of Computational Linguistics (Volume 26, Number 1). Inspired by these events, Lauri Karttunen, Kimmo Koskenniemi and Gertjan van Noord took the initiative for a workshop on finite state methods in NLP in Helsinki, as part of the European Summer School in Language, Logic and Information. As a related special event, the 20th anniversary of two-level morphology was celebrated. The appreciation of these events led us to believe that once again it should be possible, with some additional submissions, to compose an interesting special issue of this journal.

Download Full-text

Regular-expression derivatives re-examined

Journal of Functional Programming ◽

10.1017/s0956796808007090 ◽

2009 ◽

Vol 19 (2) ◽

pp. 173-190 ◽

Cited By ~ 51

Author(s):

SCOTT OWENS ◽

JOHN REPPY ◽

AARON TURON

Keyword(s):

Regular Expression ◽

Finite State Machines ◽

Regular Expressions ◽

Functional Language ◽

State Machines ◽

Boolean Operations ◽

Traditional Algorithm ◽

Computer Scientists ◽

Finite State

AbstractRegular-expression derivatives are an old, but elegant, technique for compiling regular expressions to deterministic finite-state machines. It easily supports extending the regular-expression operators with boolean operations, such as intersection and complement. Unfortunately, this technique has been lost in the sands of time and few computer scientists are aware of it. In this paper, we reexamine regular-expression derivatives and report on our experiences in the context of two different functional-language implementations. The basic implementation is simple and we show how to extend it to handle large character sets (e.g., Unicode). We also show that the derivatives approach leads to smaller state machines than the traditional algorithm given by McNaughton and Yamada.

Download Full-text

Neural finite-state transducers: a bottom-up approach to natural language processing

IJCNN'99. International Joint Conference on Neural Networks. Proceedings (Cat. No.99CH36339) ◽

10.1109/ijcnn.1999.830873 ◽

2003 ◽

Author(s):

R. Pozarlik

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Bottom Up ◽

Finite State Transducers ◽

Finite State

Download Full-text