Abstract

Most available semantic parsing datasets, comprising of pairs of natural utterances and logical forms, were collected solely for the purpose of training and evaluation of natural language understanding systems. As a result, they do not contain any of the richness and variety of natural-occurring utterances, where humans ask about data they need or are curious about. In this work, we release SEDE, a dataset with 12,023 pairs of utterances and SQL queries collected from real usage on the Stack Exchange website. We show that these pairs contain a variety of real-world challenges which were rarely reflected so far in any other semantic parsing dataset, propose an evaluation metric based on comparison of partial query clauses that is more suitable for real-world queries, and conduct experiments with strong baselines, showing a large gap between the performance on SEDE compared to other common datasets.

Highlights

  • The task of mapping natural language into logical forms that can be executed on a database or knowledge graph, has been studied mostly on academic datasets, where both the utterances and the queries were written as part of a dataset collection process (Hemphill et al, 1990; Zelle and Mooney, 1996; Yu et al, 2018), and not in a natural process where users ask questions about data they need or are curious about

  • Compared to other Text-to-SQL datasets, we show that SEDE contains at least 10 times more SQL queries templates than other datasets, and has the most diverse set of utterances and SQL queries out of all singledomain datasets

  • Standard evaluation metrics such as denotation accuracy and exact comparison of SQL components can often be used with relative success, but we found this to be a greater challenge in SEDE

Read more

Summary

Introduction

The task of mapping natural language into logical forms that can be executed on a database or knowledge graph, has been studied mostly on academic datasets, where both the utterances and the queries were written as part of a dataset collection process (Hemphill et al, 1990; Zelle and Mooney, 1996; Yu et al, 2018), and not in a natural process where users ask questions about data they need or are curious about. Denotation accuracy is inaccurate for under-specified utterances, where any single clause not mentioned in the question could entirely change execution results, while exact match comparison of SQL components (e.g. comparing all SELECT, WHERE, GROUP BY and ORDER BY clauses) are often too strict when queries are highly complex. While solving these issues still remains an open problem, to at least partially address them we propose to measure a softer version of the exact match metric, PCM-F1, based on partially extracted queries components, and show that this metric gives a better indication of models’ performance than common metrics, which yield a score that is close to 0. We hope that the unique and challenging properties exhibited in SEDE1 will pave a path for future work on gen-

Background
Stack Exchange Data Explorer
Data cleaning
Dataset Characteristics
Limitations
Evaluation
Sub-tree elements matching
Experimantal Setup
Main Results
PCM-F1 Validation
Error Analysis
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call