Sketch-Driven Regular Expression Generation from Natural Language and Examples

Recent systems for converting natural language descriptions into regular expressions (regexes) have achieved some success, but typically deal with short, formulaic text and can only produce simple regexes. Real-world regexes are complex, hard to describe with brief sentences, and sometimes require examples to fully convey the user’s intent. We present a framework for regex synthesis in this setting where both natural language (NL) and examples are available. First, a semantic parser (either grammar-based or neural) maps the natural language description into an intermediate sketch, which is an incomplete regex containing holes to denote missing components. Then a program synthesizer searches over the regex space defined by the sketch and finds a regex that is consistent with the given string examples. Our semantic parser can be trained purely from weak supervision based on correctness of the synthesized regex, or it can leverage heuristically derived sketches. We evaluate on two prior datasets (Kushman and Barzilay 2013 ; Locascio et al. 2016 ) and a real-world dataset from Stack Overflow. Our system achieves state-of-the-art performance on the prior datasets and solves 57% of the real-world dataset, which existing neural systems completely fail on. 1

Download Full-text

Large-scale Semantic Parsing without Question-Answer Pairs

Transactions of the Association for Computational Linguistics ◽

10.1162/tacl_a_00190 ◽

2014 ◽

Vol 2 ◽

pp. 377-392 ◽

Cited By ~ 40

Author(s):

Siva Reddy ◽

Mirella Lapata ◽

Mark Steedman

Keyword(s):

Natural Language ◽

Large Scale ◽

Graph Matching ◽

State Of The Art ◽

The State ◽

Semantic Parsing ◽

Matching Problem ◽

Weak Supervision ◽

Benchmark Datasets

In this paper we introduce a novel semantic parsing approach to query Freebase in natural language without requiring manual annotations or question-answer pairs. Our key insight is to represent natural language via semantic graphs whose topology shares many commonalities with Freebase. Given this representation, we conceptualize semantic parsing as a graph matching problem. Our model converts sentences to semantic graphs using CCG and subsequently grounds them to Freebase guided by denotations as a form of weak supervision. Evaluation experiments on a subset of the Free917 and WebQuestions benchmark datasets show our semantic parser improves over the state of the art.

Download Full-text

Regular expressions for language engineering

Natural Language Engineering ◽

10.1017/s1351324997001563 ◽

1996 ◽

Vol 2 (4) ◽

pp. 305-328 ◽

Cited By ~ 46

Author(s):

L. KARTTUNEN ◽

J-P. CHANOD ◽

G. GREFENSTETTE ◽

A. SCHILLE

Keyword(s):

Natural Language ◽

Regular Expression ◽

Regular Expressions ◽

Language Engineering ◽

Finite State Transducers ◽

Finite State ◽

Processing Steps

Many of the processing steps in natural language engineering can be performed using finite state transducers. An optimal way to create such transducers is to compile them from regular expressions. This paper is an introduction to the regular expression calculus, extended with certain operators that have proved very useful in natural language applications ranging from tokenization to light parsing. The examples in the paper illustrate in concrete detail some of these applications.

Download Full-text

FOREST: An Interactive Multi-tree Synthesizer for Regular Expressions

Tools and Algorithms for the Construction and Analysis of Systems - Lecture Notes in Computer Science ◽

10.1007/978-3-030-72016-2_9 ◽

2021 ◽

pp. 152-169

Author(s):

Margarida Ferreira ◽

Miguel Terra-Neves ◽

Miguel Ventura ◽

Inês Lynce ◽

Ruben Martins

Keyword(s):

Real World ◽

Regular Expression ◽

User Interaction ◽

Search Space ◽

Satisfiability Modulo Theories ◽

Experimental Results ◽

Divide And Conquer ◽

Regular Expressions ◽

Smt Solver ◽

Synthesis Procedure

AbstractForm validators based on regular expressions are often used on digital forms to prevent users from inserting data in the wrong format. However, writing these validators can pose a challenge to some users.We present Forest, a regular expression synthesizer for digital form validations. Forest produces a regular expression that matches the desired pattern for the input values and a set of conditions over capturing groups that ensure the validity of integer values in the input. Our synthesis procedure is based on enumerative search and uses a Satisfiability Modulo Theories (SMT) solver to explore and prune the search space. We propose a novel representation for regular expressions synthesis, multi-tree, which induces patterns in the examples and uses them to split the problem through a divide-and-conquer approach. We also present a new SMT encoding to synthesize capture conditions for a given regular expression. To increase confidence in the synthesized regular expression, we implement user interaction based on distinguishing inputs.We evaluated Forest on real-world form-validation instances using regular expressions. Experimental results show that Forest successfully returns the desired regular expression in 70% of the instances and outperforms Regel, a state-of-the-art regular expression synthesizer.

Download Full-text

Multi-modal program inference: a marriage of pre-trained language models and component-based synthesis

Proceedings of the ACM on Programming Languages ◽

10.1145/3485535 ◽

2021 ◽

Vol 5 (OOPSLA) ◽

pp. 1-29

Author(s):

Kia Rahmani ◽

Mohammad Raza ◽

Sumit Gulwani ◽

Vu Le ◽

Daniel Morris ◽

...

Keyword(s):

Natural Language ◽

State Of The Art ◽

Program Synthesis ◽

Language Models ◽

Regular Expressions ◽

Natural Languages ◽

Modal Synthesis ◽

Combination Approach ◽

Specialized System ◽

Correct Code

Multi-modal program synthesis refers to the task of synthesizing programs (code) from their specification given in different forms, such as a combination of natural language and examples. Examples provide a precise but incomplete specification, and natural language provides an ambiguous but more "complete" task description. Machine-learned pre-trained models (PTMs) are adept at handling ambiguous natural language, but struggle with generating syntactically and semantically precise code. Program synthesis techniques can generate correct code, often even from incomplete but precise specifications, such as examples, but they are unable to work with the ambiguity of natural languages. We present an approach that combines PTMs with component-based synthesis (CBS): PTMs are used to generate candidates programs from the natural language description of the task, which are then used to guide the CBS procedure to find the program that matches the precise examples-based specification. We use our combination approach to instantiate multi-modal synthesis systems for two programming domains: the domain of regular expressions and the domain of CSS selectors. Our evaluation demonstrates the effectiveness of our domain-agnostic approach in comparison to a state-of-the-art specialized system, and the generality of our approach in providing multi-modal program synthesis from natural language and examples in different programming domains.

Download Full-text

Rewriting Minimizations for Efficient Query Answering over Ontologies

International Journal of Artificial Intelligence Tools ◽

10.1142/s0218213017600247 ◽

2017 ◽

Vol 26 (05) ◽

pp. 1760024 ◽

Cited By ~ 1

Author(s):

Tassos Venetis ◽

Giorgos Stoilos ◽

Vasilis Vassalos

Keyword(s):

Real World ◽

Experimental Evaluation ◽

State Of The Art ◽

Query Answering ◽

Current Paper ◽

Query Rewriting ◽

Speed Up ◽

Using Data ◽

The Given ◽

Data Constraints

Computing a (Union of Conjunctive Queries — UCQ) rewriting ℛ for an input query and ontology and evaluating it over the given dataset is a prominent approach to query answering over ontologies. However, ℛ can be large and complex in structure hence additional techniques, like query subsumption and data constraints, need to be employed in order to minimize ℛ and lead to an efficient evaluation. Although sound in theory, how to efficiently and effectively implement many of these techniques in practice could be challenging. For example, many systems do not implement query subsumption. In the current paper we present several practical techniques for UCQ rewriting minimization. First, we present an optimized algorithm for eliminating redundant (w.r.t. subsumption) queries as well as a novel framework for rewriting minimization using data constraints. Second, we show how these techniques can also be used to speed up the computation of ℛ in first place. Third, we integrated all our techniques in our query rewriting system IQAROS and conducted an extensive experimental evaluation using many artificial as well as challenging real-world ontologies obtaining encouraging results as, in the vast majority of cases, our system is more efficient compared to the two most popular state-of-the-art systems.

Download Full-text

Weakly Supervised Learning of Semantic Parsers for Mapping Instructions to Actions

Transactions of the Association for Computational Linguistics ◽

10.1162/tacl_a_00209 ◽

2013 ◽

Vol 1 ◽

pp. 49-62 ◽

Cited By ~ 51

Author(s):

Yoav Artzi ◽

Luke Zettlemoyer

Keyword(s):

Natural Language ◽

Supervised Learning ◽

State Of The Art ◽

Semantic Parsing ◽

Weak Supervision ◽

Instruction Sets ◽

Strong Signal ◽

Previous State ◽

Strong Performance ◽

Weakly Supervised

The context in which language is used provides a strong signal for learning to recover its meaning. In this paper, we show it can be used within a grounded CCG semantic parsing approach that learns a joint model of meaning and context for interpreting and executing natural language instructions, using various types of weak supervision. The joint nature provides crucial benefits by allowing situated cues, such as the set of visible objects, to directly influence learning. It also enables algorithms that learn while executing instructions, for example by trying to replicate human actions. Experiments on a benchmark navigational dataset demonstrate strong performance under differing forms of supervision, including correctly executing 60% more instruction sets relative to the previous state of the art.

Download Full-text

Localizing Natural Language in Videos

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v33i01.33018175 ◽

2019 ◽

Vol 33 ◽

pp. 8175-8182 ◽

Cited By ~ 13

Author(s):

Jingyuan Chen ◽

Lin Ma ◽

Xinpeng Chen ◽

Zequn Jie ◽

Jiebo Luo

Keyword(s):

Natural Language ◽

Video Sequence ◽

State Of The Art ◽

Recurrent Networks ◽

The Public ◽

Fine Grained ◽

Proposed Model ◽

Boundary Model ◽

The Given ◽

Language Description

In this paper, we consider the task of natural language video localization (NLVL): given an untrimmed video and a natural language description, the goal is to localize a segment in the video which semantically corresponds to the given natural language description. We propose a localizing network (LNet), working in an end-to-end fashion, to tackle the NLVL task. We first match the natural sentence and video sequence by cross-gated attended recurrent networks to exploit their fine-grained interactions and generate a sentence-aware video representation. A self interactor is proposed to perform crossframe matching, which dynamically encodes and aggregates the matching evidences. Finally, a boundary model is proposed to locate the positions of video segments corresponding to the natural sentence description by predicting the starting and ending points of the segment. Extensive experiments conducted on the public TACoS and DiDeMo datasets demonstrate that our proposed model performs effectively and efficiently against the state-of-the-art approaches.

Download Full-text

1243-P: Novel Use of Natural Language Processing to Identify Reasons for Insulin Discontinuation in Patients with T2DM: A Real-World Evidence Study

Diabetes ◽

10.2337/db19-1243-p ◽

2019 ◽

Vol 68 (Supplement 1) ◽

pp. 1243-P

Author(s):

JIANMIN WU ◽

FRITHA J. MORRISON ◽

ZHENXIANG ZHAO ◽

XUANYAO HE ◽

MARIA SHUBINA ◽

...

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Real World ◽

Real World Evidence

Download Full-text

A novel optimal multi-pattern matching method with wildcards for DNA sequence

Technology and Health Care ◽

10.3233/thc-218012 ◽

2021 ◽

Vol 29 ◽

pp. 115-124

Author(s):

Xinlu Wang ◽

Ahmed A.F. Saif ◽

Dayou Liu ◽

Yungang Zhu ◽

Jon Atli Benediktsson

Keyword(s):

Dna Sequence ◽

Pattern Matching ◽

Health Informatics ◽

State Of The Art ◽

Machine Language ◽

Data Sets ◽

Fundamental Issue ◽

Matching Method ◽

Dna Sequence Alignment ◽

The Given

BACKGROUND: DNA sequence alignment is one of the most fundamental and important operation to identify which gene family may contain this sequence, pattern matching for DNA sequence has been a fundamental issue in biomedical engineering, biotechnology and health informatics. OBJECTIVE: To solve this problem, this study proposes an optimal multi pattern matching with wildcards for DNA sequence. METHODS: This proposed method packs the patterns and a sliding window of texts, and the window slides along the given packed text, matching against stored packed patterns. RESULTS: Three data sets are used to test the performance of the proposed algorithm, and the algorithm was seen to be more efficient than the competitors because its operation is close to machine language. CONCLUSIONS: Theoretical analysis and experimental results both demonstrate that the proposed method outperforms the state-of-the-art methods and is especially effective for the DNA sequence.

Download Full-text

Report on the 4th Joint Workshop on Bibliometric-Enhanced Information Retrieval and Natural Language Processing for Digital Libraries at SIGIR 2019

ACM SIGIR Forum ◽

10.1145/3458553.3458554 ◽

2019 ◽

Vol 53 (2) ◽

pp. 3-10

Author(s):

Muthu Kumar Chandrasekaran ◽

Philipp Mayr

Keyword(s):

Information Retrieval ◽

Natural Language Processing ◽

Natural Language ◽

Research And Development ◽

Language Processing ◽

Digital Libraries ◽

State Of The Art ◽

Shared Task ◽

Processing Information ◽

Joint Workshop

The 4 th joint BIRNDL workshop was held at the 42nd ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2019) in Paris, France. BIRNDL 2019 intended to stimulate IR researchers and digital library professionals to elaborate on new approaches in natural language processing, information retrieval, scientometrics, and recommendation techniques that can advance the state-of-the-art in scholarly document understanding, analysis, and retrieval at scale. The workshop incorporated different paper sessions and the 5 th edition of the CL-SciSumm Shared Task.

Download Full-text