Application of Brzozowski Derivatives to JSON Schema Validation

Proceedings of Balisage: The Markup Conference 2019 ◽

10.4242/balisagevol23.holstege01 ◽

2019 ◽

Cited By ~ 1

Author(s):

Mary Holstege

Keyword(s):

Model Validation ◽

Regular Expression ◽

Xml Schema ◽

Regular Expressions ◽

Matching Problems ◽

Content Model

Brzozowski derivatives are a technique for computing whether a string of symbols is in the language defined by an extended regular expression. They have been applied to content model validation in XML Schema, following the observation that a content model is an extended regular expression over symbols in the vocabulary described by the schema. This paper explores using an extension of Brzozowski derivatives to the problem of model validation for JSON Schema. It turns out that this application requires extending to "type-tagged" regular expressions, which provide an interesting way of understanding certain matching problems outside of the problem of JSON Schema validation.

Download Full-text

Towards an Effective Syntax and a Generator for Deterministic Standard Regular Expressions

The Computer Journal ◽

10.1093/comjnl/bxy110 ◽

2018 ◽

Vol 62 (9) ◽

pp. 1322-1341

Author(s):

Zhiwu Xu ◽

Ping Lu ◽

Haiming Chen

Keyword(s):

Regular Expression ◽

Xml Schema ◽

Experimental Results ◽

Regular Expressions ◽

Context Free ◽

Core Part ◽

Context Free Grammars

Abstract Deterministic regular expressions are a core part of XML Schema and used in other applications. But unlike regular expressions, deterministic regular expressions do not have a simple syntax, instead they are defined in a semantic manner. Moreover, not every regular expression can be rewritten to an equivalent deterministic regular expression. These properties of deterministic regular expressions put a burden on the user to develop XML Schema Definitions and to use deterministic regular expressions. In this paper, we propose a syntax for deterministic standard regular expressions (DREGs), and prove that the syntax of DREGs is context-free. Based on the context-free grammars for DREGs, we further design a generator for DREGs, which can generate DREGs randomly, and be used in applications associated with DREGs, e.g. benchmarking a validator for DTD or XML Schema, and inclusion checking of DTD and XML Schema. Experimental results demonstrate the efficiency and usefulness of the generator.

Download Full-text

Random Regular Expression Over Huge Alphabets

International Journal of Foundations of Computer Science ◽

10.1142/s012905412141001x ◽

2021 ◽

pp. 1-20

Author(s):

Cyril Nicaud ◽

Pablo Rotondo

Keyword(s):

Regular Expression ◽

Empty Word ◽

Regular Expressions ◽

Expected Number ◽

Analytic Combinatorics ◽

Transfer Theorem ◽

Leading Term

In this article, we study some properties of random regular expressions of size [Formula: see text], when the cardinality of the alphabet also depends on [Formula: see text]. For this, we revisit and improve the classical Transfer Theorem from the field of analytic combinatorics. This provides precise estimations for the number of regular expressions, the probability of recognizing the empty word and the expected number of Kleene stars in a random expression. For all these statistics, we show that there is a threshold when the size of the alphabet approaches [Formula: see text], at which point the leading term in the asymptotics starts oscillating.

Download Full-text

Towards a Normal Form and a Query Language for Extended Relations Defined by Regular Expressions

Journal of Database Management ◽

10.4018/jdm.2016040102 ◽

2016 ◽

Vol 27 (2) ◽

pp. 27-48

Author(s):

András Benczúr ◽

Gyula I. Szabó

Keyword(s):

Normal Form ◽

Data Base ◽

Data Model ◽

Query Language ◽

Xml Schema ◽

Relational Model ◽

Regular Expressions ◽

Functional Dependencies ◽

Decision Algorithm ◽

Implication Problem

This paper introduces a generalized data base concept that unites relational and semi structured data models. As an important theoretical result we could find a quadratic decision algorithm for the implication problem of functional and join dependencies defined on the united data model. As practical contribution we presented a normal form for the new data model as a tool for data base design. With our novel representations of regular expressions, a more effective searching method could be developed. XML elements are described by XML schema languages such as a DTD or an XML Schema definition. The instances of these elements are semi-structured tuples. A semi-structured tuple is an ordered list of (attribute: value) pairs. We may think of a semi-structured tuple as a sentence of a formal language, where the values are the terminal symbols and the attribute names are the non-terminal symbols. In the authors' former work (Szabó and Benczúr, 2015) they introduced the notion of the extended tuple as a sentence from a regular language generated by a grammar where the non-terminal symbols of the grammar are the attribute names of the tuple. Sets of extended tuples are the extended relations. The authors then introduced the dual language, which generates the tuple types allowed to occur in extended relations. They defined functional dependencies (regular FD - RFD) over extended relations. In this paper they rephrase the RFD concept by directly using regular expressions over attribute names to define extended tuples. By the help of a special vertex labeled graph associated to regular expressions the specification of substring selection for the projection operation can be defined. The normalization for regular schemas is more complex than it is in the relational model, because the schema of an extended relation can contain an infinite number of tuple types. However, the authors can define selection, projection and join operations on extended relations too, so a lossless-join decomposition can be performed. They extended their previous model to deal with XML schema indicators too, e.g., with numerical constraints. They added line and set constructors too, in order to extend their model with more general projection and selection operators. This model establishes a query language with table join functionality for collected XML element data.

Download Full-text

Software Toolchain for Large-Scale RE-NFA Construction on FPGA

International Journal of Reconfigurable Computing ◽

10.1155/2009/301512 ◽

2009 ◽

Vol 2009 ◽

pp. 1-10 ◽

Cited By ~ 3

Author(s):

Yi-Hua E. Yang ◽

Viktor K. Prasanna

Keyword(s):

High Performance ◽

Large Scale ◽

Regular Expression ◽

Finite Automata ◽

Fixed Number ◽

Regular Expressions ◽

Pattern Complexity ◽

Regular Expression Matching ◽

Area Increase ◽

Prototype Software

We present a software toolchain for constructing large-scaleregular expression matching(REM) on FPGA. The software automates the conversion of regular expressions into compact and high-performance nondeterministic finite automata (RE-NFA). Each RE-NFA is described as an RTL regular expression matching engine (REME) in VHDL for FPGA implementation. Assuming a fixed number of fan-out transitions per state, ann-statem-bytes-per-cycle RE-NFA can be constructed inO(n×m)time andO(n×m)memory by our software. A large number of RE-NFAs are placed onto a two-dimensionalstaged pipeline, allowing scalability to thousands of RE-NFAs with linear area increase and little clock rate penalty due to scaling. On a PC with a 2 GHz Athlon64 processor and 2 GB memory, our prototype software constructs hundreds of RE-NFAs used by Snort in less than 10 seconds. We also designed a benchmark generator which can produce RE-NFAs with configurable pattern complexity parameters, including state count, state fan-in, loop-back and feed-forward distances. Several regular expressions with various complexities are used to test the performance of our RE-NFA construction software.

Download Full-text

ON THE AVERAGE SIZE OF GLUSHKOV AND PARTIAL DERIVATIVE AUTOMATA

International Journal of Foundations of Computer Science ◽

10.1142/s0129054112400400 ◽

2012 ◽

Vol 23 (05) ◽

pp. 969-984 ◽

Cited By ~ 12

Author(s):

SABINE BRODA ◽

ANTÓNIO MACHIAVELO ◽

NELMA MOREIRA ◽

ROGÉRIO REIS

Keyword(s):

Partial Derivative ◽

Upper Bound ◽

Regular Expression ◽

Regular Expressions ◽

Average Case ◽

Alphabet Size ◽

Large Alphabet ◽

Exact Counting ◽

Average Transition ◽

Average Size

In this paper, the relation between the Glushkov automaton [Formula: see text] and the partial derivative automaton [Formula: see text] of a given regular expression, in terms of transition complexity, is studied. The average transition complexity of [Formula: see text] was proved by Nicaud to be linear in the size of the corresponding expression. This result was obtained using an upper bound of the number of transitions of [Formula: see text]. Here we present a new quadratic construction of [Formula: see text] that leads to a more elegant and straightforward implementation, and that allows the exact counting of the number of transitions. Based on that, a better estimation of the average size is presented. Asymptotically, and as the alphabet size grows, the number of transitions per state is on average 2. Broda et al. computed an upper bound for the ratio of the number of states of [Formula: see text] to the number of states of [Formula: see text] which is about ½ for large alphabet sizes. Here we show how to obtain an upper bound for the number of transitions in [Formula: see text], which we then use to get an average case approximation. In conclusion, assymptotically, and for large alphabets, the size of [Formula: see text] is half the size of the [Formula: see text]. This is corroborated by some experiments, even for small alphabets and small regular expressions.

Download Full-text

The Design of a Verified Derivative-Based Parsing Tool for Regular Expressions

CLEI electronic journal ◽

10.19153/cleiej.24.3.2 ◽

2021 ◽

Vol 24 (3) ◽

Author(s):

Elton Cardoso ◽

Maycon Amaro ◽

Samuel Feitosa ◽

Leonardo Reis ◽

André Du Bois ◽

...

Keyword(s):

Regular Expression ◽

Input String ◽

Regular Expressions

We describe the formalization of Brzozowski and Antimirov derivative based algorithms for regular expression parsing, in the dependently typed language Agda. The formalization produces a proof that either an input string matches a given regular expression or that no matching exists. A tool for regular expression based search in the style of the well known GNU grep has been developed with the certified algorithms. Practical experiments conducted with this tool are reported.

Download Full-text

EVALUATION OF THREE IMPLICIT STRUCTURES TO IMPLEMENT NONDETERMINISTIC AUTOMATA FROM REGULAR EXPRESSIONS

International Journal of Foundations of Computer Science ◽

10.1142/s0129054102000972 ◽

2002 ◽

Vol 13 (01) ◽

pp. 99-113 ◽

Cited By ~ 3

Author(s):

JEAN-MARC CHAMPARNAUD

Keyword(s):

Regular Expression ◽

Text Structure ◽

Regular Expressions ◽

Space And Time ◽

Construction Step

The aim of this paper is to compare three efficient representations of the position automaton of a regular expression: the Thompson ∊-automaton, the [Formula: see text]-structure and the ℱ-structure, an optimization of the [Formula: see text]-structure. These representations are linear w.r.t. the size s of the expression, since their construction is in O(s) space and time, as well as the computation of the set δ(X,a) of the targets of the transitions by a of any subset X of states. The comparison is based on the evaluation of the number of edges of the underlying graphs respectively created by the construction step or visited by the computation of a set δ(X,a).

Download Full-text

Regular expressions for language engineering

Natural Language Engineering ◽

10.1017/s1351324997001563 ◽

1996 ◽

Vol 2 (4) ◽

pp. 305-328 ◽

Cited By ~ 46

Author(s):

L. KARTTUNEN ◽

J-P. CHANOD ◽

G. GREFENSTETTE ◽

A. SCHILLE

Keyword(s):

Natural Language ◽

Regular Expression ◽

Regular Expressions ◽

Language Engineering ◽

Finite State Transducers ◽

Finite State ◽

Processing Steps

Many of the processing steps in natural language engineering can be performed using finite state transducers. An optimal way to create such transducers is to compile them from regular expressions. This paper is an introduction to the regular expression calculus, extended with certain operators that have proved very useful in natural language applications ranging from tokenization to light parsing. The examples in the paper illustrate in concrete detail some of these applications.

Download Full-text

Regular expression filters for XML

Journal of Functional Programming ◽

10.1017/s0956796806005909 ◽

2006 ◽

Vol 16 (6) ◽

pp. 711-750 ◽

Cited By ~ 5

Author(s):

HARUO HOSAYA

Keyword(s):

Expression Pattern ◽

Pattern Matching ◽

Regular Expression ◽

Type Inference ◽

The Body ◽

Regular Expressions ◽

Process Data ◽

Inference Mechanism ◽

Xml Data ◽

Previous Proposal

XML data are described by types involving regular expressions. This raises the question of what language feature is convenient for manipulating such data. Previously, we have given an answer to this question by proposing regular expression pattern matching. However, since this construct is derived from ML pattern matching, it does not have an iteration functionality in itself, which makes it cumbersome to process data typed by Kleene stars. In this paper, we propose a novel programming feature regular expression filters. This construct extends the previous proposal by permitting pattern clauses to be closed under arbitrary regular expression operators. This yields many convenient programming idioms such as non-uniform processing of sequences and almost-copying of trees. We further develop a type inference mechanism that obtains (1) types for pattern variables that are locally precise with respect to the type of input values and (2) a type for the result of the whole filter expression that is also locally precise with respect to the types of the body expressions. We discuss how our construct is useful in the practice of XML processing and, in particular, how our type inference is crucial for avoiding changes of programs when types of data to be processed evolve frequently.

Download Full-text

Finite Automata on Transfinite Sequences and Regular Expressions

Fundamenta Informaticae ◽

10.3233/fi-1985-83-407 ◽

1985 ◽

Vol 8 (3-4) ◽

pp. 379-396

Author(s):

Jerzy Wojciechowski

Keyword(s):

Regular Expression ◽

Finite Automata ◽

Characterization Theorem ◽

Regular Expressions ◽

Emptiness Problem

In this paper the notion of regular expression for finite automata on transfinite sequences /TF-automata/ is introduced. The characterization theorem for TF-automata is proved. From this theorem we conclude the decidability of the emptiness problem for TF-automata and the characterization theorem for finite automata on transfinite sequences of bounded lenght.

Download Full-text