Towards an Effective Syntax and a Generator for Deterministic Standard Regular Expressions

2018 ◽  
Vol 62 (9) ◽  
pp. 1322-1341
Author(s):  
Zhiwu Xu ◽  
Ping Lu ◽  
Haiming Chen

Abstract Deterministic regular expressions are a core part of XML Schema and used in other applications. But unlike regular expressions, deterministic regular expressions do not have a simple syntax, instead they are defined in a semantic manner. Moreover, not every regular expression can be rewritten to an equivalent deterministic regular expression. These properties of deterministic regular expressions put a burden on the user to develop XML Schema Definitions and to use deterministic regular expressions. In this paper, we propose a syntax for deterministic standard regular expressions (DREGs), and prove that the syntax of DREGs is context-free. Based on the context-free grammars for DREGs, we further design a generator for DREGs, which can generate DREGs randomly, and be used in applications associated with DREGs, e.g. benchmarking a validator for DTD or XML Schema, and inclusion checking of DTD and XML Schema. Experimental results demonstrate the efficiency and usefulness of the generator.

Author(s):  
Mary Holstege

Brzozowski derivatives are a technique for computing whether a string of symbols is in the language defined by an extended regular expression. They have been applied to content model validation in XML Schema, following the observation that a content model is an extended regular expression over symbols in the vocabulary described by the schema. This paper explores using an extension of Brzozowski derivatives to the problem of model validation for JSON Schema. It turns out that this application requires extending to "type-tagged" regular expressions, which provide an interesting way of understanding certain matching problems outside of the problem of JSON Schema validation.


Author(s):  
Margarida Ferreira ◽  
Miguel Terra-Neves ◽  
Miguel Ventura ◽  
Inês Lynce ◽  
Ruben Martins

AbstractForm validators based on regular expressions are often used on digital forms to prevent users from inserting data in the wrong format. However, writing these validators can pose a challenge to some users.We present Forest, a regular expression synthesizer for digital form validations. Forest produces a regular expression that matches the desired pattern for the input values and a set of conditions over capturing groups that ensure the validity of integer values in the input. Our synthesis procedure is based on enumerative search and uses a Satisfiability Modulo Theories (SMT) solver to explore and prune the search space. We propose a novel representation for regular expressions synthesis, multi-tree, which induces patterns in the examples and uses them to split the problem through a divide-and-conquer approach. We also present a new SMT encoding to synthesize capture conditions for a given regular expression. To increase confidence in the synthesized regular expression, we implement user interaction based on distinguishing inputs.We evaluated Forest on real-world form-validation instances using regular expressions. Experimental results show that Forest successfully returns the desired regular expression in 70% of the instances and outperforms Regel, a state-of-the-art regular expression synthesizer.


Author(s):  
Juntao Gao ◽  
Yingqian Zhang

This paper presents a novel method to infer regular expressions from positive examples. The method consists of a candidate’s construction phase and an optimization phase. We first propose multiscaling sample augmentation to capture the cycle patterns from single examples during the candidate’s construction phase. We then use common substrings to build regular expressions that capture patterns across multiple examples, and we show this algorithm is more general than those based on common prefixes or suffixes. Furthermore, we propose a pruning mechanism to improve the efficiency of useful common substring mining, which is an important part of common substring-based expression building algorithm. Finally, in the optimization phase, we model the problem of choosing a set of regular expressions with the lowest cost as an integer linear program, which can be solved to obtain the optimal solution. The experimental results on synthetic and real-life samples demonstrate the effectiveness of our approach in inferring concise and semantically meaningful regular expressions for string datasets.


Sign in / Sign up

Export Citation Format

Share Document