Sketch-Driven Regular Expression Generation from Natural Language and Examples

Xi Ye,Qiaochu Chen,Xinyu Wang,Isil Dillig,Greg Durrett

doi:10.1162/tacl_a_00339

Abstract

Recent systems for converting natural language descriptions into regular expressions (regexes) have achieved some success, but typically deal with short, formulaic text and can only produce simple regexes. Real-world regexes are complex, hard to describe with brief sentences, and sometimes require examples to fully convey the user’s intent. We present a framework for regex synthesis in this setting where both natural language (NL) and examples are available. First, a semantic parser (either grammar-based or neural) maps the natural language description into an intermediate sketch, which is an incomplete regex containing holes to denote missing components. Then a program synthesizer searches over the regex space defined by the sketch and finds a regex that is consistent with the given string examples. Our semantic parser can be trained purely from weak supervision based on correctness of the synthesized regex, or it can leverage heuristically derived sketches. We evaluate on two prior datasets (Kushman and Barzilay 2013 ; Locascio et al. 2016 ) and a real-world dataset from Stack Overflow. Our system achieves state-of-the-art performance on the prior datasets and solves 57% of the real-world dataset, which existing neural systems completely fail on. 1

Highlights

Regular expressions are widely used in various domains, but are notoriously difficult to write: regex is one of the most popular tags of posts on Stack Overflow, with over 200,000 posts
To test our model in a more realistic setting, we evaluate on a dataset of real-world regex synthesis problems from Stack Overflow
We investigate two paradigms of semantic parser: a seq-to-seq neural network parser and a grammarbased parser, as well as two ways of training the parser: maximum likelihood estimation based on a pseudo-gold sketch and maximum marginal likelihood based on whether the sketch leads to the correct synthesis result

Summary

Introduction

Regular expressions (regexes) are widely used in various domains, but are notoriously difficult to write: regex is one of the most popular tags of posts on Stack Overflow, with over 200,000 posts. Techniques (Ranta, 1998), semantic parsing (Kushman and Barzilay, 2013), or seq-to-seq neural network models (Locascio et al, 2016; Zhong et al, 2018a; Park et al, 2019) While this prior work has achieved relatively high accuracy on benchmark datasets, trained models still do not generalize to real-world applications: these benchmarks describe simple regexes with short natural language descriptions and limited vocabulary. To test our model in a more realistic setting, we evaluate on a dataset of real-world regex synthesis problems from Stack Overflow These problems organically have English language descriptions and paired examples that the user wrote to communicate their intent. This dataset is small, only 62 examples; to more robustly handle this setting without large-scale training data, we instantiate our sketch framework with a grammar-based semantic parser. While more data is needed, this dataset can motivate further work on more challenging regex synthesis problems

Regex Synthesis Framework

Neural Parser

Grammar-Based Parser

Training

Datasets

StackOverflow

Dataset Preprocessing

Experiments

Evaluation

Results

Detailed Analysis

Examples of Success And Failure Pairs

Related Work

Conclusion

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Transactions of the Association for Computational Linguistics	Publication Date: Dec 1, 2020
Citations: 18	License type: cc-by

R Discovery Prime

R Discovery Prime

Sketch-Driven Regular Expression Generation from Natural Language and Examples

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Transactions of the Association for Computational Linguistics

Lead the way for us

Similar Papers

Text to Code Conversion Using Deep Learning for NLP
Mellah Youssef ... Belkasmi Mohammed Gaouth
-
Mellah Youssef, et. al.Mellah Youssef ... Belkasmi Mohammed Gaouth
25 Nov 2020
25 Nov 2020

TransRegex: Multi-modal Regular Expression Synthesis by Generate-and-Repair
Yeting Li ... Shuaimin Li
-
Yeting Li, et. al.Yeting Li ... Shuaimin Li
01 May 2021
01 May 2021

Automatic Requirements Specification Extraction from Natural Language (ARSENAL)
Daniel Elenius ... Wilfrid Steiener
-
Daniel Elenius, et. al.Daniel Elenius ... Wilfrid Steiener
01 Oct 2014
01 Oct 2014

NL2CMD: An Updated Workflow for Natural Language to Bash Commands Translation
...
arXiv (Cornell University) | VOL. -
, et. al. ...
15 Feb 2023
arXiv (Cornell University) | VOL. -

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Sketch-Driven Regular Expression Generation from Natural Language and Examples

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Transactions of the Association for Computational Linguistics