Abstract

Recent systems for converting natural language descriptions into regular expressions (regexes) have achieved some success, but typically deal with short, formulaic text and can only produce simple regexes. Real-world regexes are complex, hard to describe with brief sentences, and sometimes require examples to fully convey the user’s intent. We present a framework for regex synthesis in this setting where both natural language (NL) and examples are available. First, a semantic parser (either grammar-based or neural) maps the natural language description into an intermediate sketch, which is an incomplete regex containing holes to denote missing components. Then a program synthesizer searches over the regex space defined by the sketch and finds a regex that is consistent with the given string examples. Our semantic parser can be trained purely from weak supervision based on correctness of the synthesized regex, or it can leverage heuristically derived sketches. We evaluate on two prior datasets (Kushman and Barzilay 2013 ; Locascio et al. 2016 ) and a real-world dataset from Stack Overflow. Our system achieves state-of-the-art performance on the prior datasets and solves 57% of the real-world dataset, which existing neural systems completely fail on. 1

Highlights

  • Regular expressions are widely used in various domains, but are notoriously difficult to write: regex is one of the most popular tags of posts on Stack Overflow, with over 200,000 posts

  • To test our model in a more realistic setting, we evaluate on a dataset of real-world regex synthesis problems from Stack Overflow

  • We investigate two paradigms of semantic parser: a seq-to-seq neural network parser and a grammarbased parser, as well as two ways of training the parser: maximum likelihood estimation based on a pseudo-gold sketch and maximum marginal likelihood based on whether the sketch leads to the correct synthesis result

Read more

Summary

Introduction

Regular expressions (regexes) are widely used in various domains, but are notoriously difficult to write: regex is one of the most popular tags of posts on Stack Overflow, with over 200,000 posts. Techniques (Ranta, 1998), semantic parsing (Kushman and Barzilay, 2013), or seq-to-seq neural network models (Locascio et al, 2016; Zhong et al, 2018a; Park et al, 2019) While this prior work has achieved relatively high accuracy on benchmark datasets, trained models still do not generalize to real-world applications: these benchmarks describe simple regexes with short natural language descriptions and limited vocabulary. To test our model in a more realistic setting, we evaluate on a dataset of real-world regex synthesis problems from Stack Overflow These problems organically have English language descriptions and paired examples that the user wrote to communicate their intent. This dataset is small, only 62 examples; to more robustly handle this setting without large-scale training data, we instantiate our sketch framework with a grammar-based semantic parser. While more data is needed, this dataset can motivate further work on more challenging regex synthesis problems

Regex Synthesis Framework
Neural Parser
Grammar-Based Parser
Training
Datasets
StackOverflow
Dataset Preprocessing
Experiments
Evaluation
Results
Detailed Analysis
Examples of Success And Failure Pairs
Related Work
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.