Towards an Effective Syntax and a Generator for Deterministic Standard Regular Expressions

Abstract Deterministic regular expressions are a core part of XML Schema and used in other applications. But unlike regular expressions, deterministic regular expressions do not have a simple syntax, instead they are defined in a semantic manner. Moreover, not every regular expression can be rewritten to an equivalent deterministic regular expression. These properties of deterministic regular expressions put a burden on the user to develop XML Schema Definitions and to use deterministic regular expressions. In this paper, we propose a syntax for deterministic standard regular expressions (DREGs), and prove that the syntax of DREGs is context-free. Based on the context-free grammars for DREGs, we further design a generator for DREGs, which can generate DREGs randomly, and be used in applications associated with DREGs, e.g. benchmarking a validator for DTD or XML Schema, and inclusion checking of DTD and XML Schema. Experimental results demonstrate the efficiency and usefulness of the generator.

Download Full-text

Regular expressions and context-free grammars for picture languages

Lecture Notes in Computer Science - STACS 97 ◽

10.1007/bfb0023466 ◽

1997 ◽

pp. 283-294 ◽

Cited By ~ 24

Author(s):

Oliver Matz

Keyword(s):

Regular Expressions ◽

Picture Languages ◽

Context Free ◽

Context Free Grammars

Download Full-text

Context-Free Grammars for Deterministic Regular Expressions with Interleaving

Theoretical Aspects of Computing – ICTAC 2019 - Lecture Notes in Computer Science ◽

10.1007/978-3-030-32505-3_14 ◽

2019 ◽

pp. 235-252

Author(s):

Xiaoying Mou ◽

Haiming Chen ◽

Yeting Li

Keyword(s):

Regular Expressions ◽

Context Free ◽

Context Free Grammars

Download Full-text

Application of Brzozowski Derivatives to JSON Schema Validation

Proceedings of Balisage: The Markup Conference 2019 ◽

10.4242/balisagevol23.holstege01 ◽

2019 ◽

Cited By ~ 1

Author(s):

Mary Holstege

Keyword(s):

Model Validation ◽

Regular Expression ◽

Xml Schema ◽

Regular Expressions ◽

Matching Problems ◽

Content Model

Brzozowski derivatives are a technique for computing whether a string of symbols is in the language defined by an extended regular expression. They have been applied to content model validation in XML Schema, following the observation that a content model is an extended regular expression over symbols in the vocabulary described by the schema. This paper explores using an extension of Brzozowski derivatives to the problem of model validation for JSON Schema. It turns out that this application requires extending to "type-tagged" regular expressions, which provide an interesting way of understanding certain matching problems outside of the problem of JSON Schema validation.

Download Full-text

FOREST: An Interactive Multi-tree Synthesizer for Regular Expressions

Tools and Algorithms for the Construction and Analysis of Systems - Lecture Notes in Computer Science ◽

10.1007/978-3-030-72016-2_9 ◽

2021 ◽

pp. 152-169

Author(s):

Margarida Ferreira ◽

Miguel Terra-Neves ◽

Miguel Ventura ◽

Inês Lynce ◽

Ruben Martins

Keyword(s):

Real World ◽

Regular Expression ◽

User Interaction ◽

Search Space ◽

Satisfiability Modulo Theories ◽

Experimental Results ◽

Divide And Conquer ◽

Regular Expressions ◽

Smt Solver ◽

Synthesis Procedure

AbstractForm validators based on regular expressions are often used on digital forms to prevent users from inserting data in the wrong format. However, writing these validators can pose a challenge to some users.We present Forest, a regular expression synthesizer for digital form validations. Forest produces a regular expression that matches the desired pattern for the input values and a set of conditions over capturing groups that ensure the validity of integer values in the input. Our synthesis procedure is based on enumerative search and uses a Satisfiability Modulo Theories (SMT) solver to explore and prune the search space. We propose a novel representation for regular expressions synthesis, multi-tree, which induces patterns in the examples and uses them to split the problem through a divide-and-conquer approach. We also present a new SMT encoding to synthesize capture conditions for a given regular expression. To increase confidence in the synthesized regular expression, we implement user interaction based on distinguishing inputs.We evaluated Forest on real-world form-validation instances using regular expressions. Experimental results show that Forest successfully returns the desired regular expression in 70% of the instances and outperforms Regel, a state-of-the-art regular expression synthesizer.

Download Full-text

Regular Expression Learning from Positive Examples Based on Integer Programming

International Journal of Software Engineering and Knowledge Engineering ◽

10.1142/s0218194020400203 ◽

2020 ◽

Vol 30 (10) ◽

pp. 1443-1479

Author(s):

Juntao Gao ◽

Yingqian Zhang

Keyword(s):

Integer Programming ◽

Regular Expression ◽

Real Life ◽

Optimal Solution ◽

Linear Program ◽

Experimental Results ◽

Integer Linear Program ◽

Regular Expressions ◽

Novel Method ◽

Learning From Positive Examples

This paper presents a novel method to infer regular expressions from positive examples. The method consists of a candidate’s construction phase and an optimization phase. We first propose multiscaling sample augmentation to capture the cycle patterns from single examples during the candidate’s construction phase. We then use common substrings to build regular expressions that capture patterns across multiple examples, and we show this algorithm is more general than those based on common prefixes or suffixes. Furthermore, we propose a pruning mechanism to improve the efficiency of useful common substring mining, which is an important part of common substring-based expression building algorithm. Finally, in the optimization phase, we model the problem of choosing a set of regular expressions with the lowest cost as an integer linear program, which can be solved to obtain the optimal solution. The experimental results on synthetic and real-life samples demonstrate the effectiveness of our approach in inferring concise and semantically meaningful regular expressions for string datasets.

Download Full-text