Abstract

We present Spider, a large-scale complex and cross-domain semantic parsing and text-to-SQL dataset annotated by 11 college students. It consists of 10,181 questions and 5,693 unique complex SQL queries on 200 databases with multiple tables covering 138 different domains. We define a new complex and cross-domain semantic parsing and text-to-SQL task so that different complicated SQL queries and databases appear in train and test sets. In this way, the task requires the model to generalize well to both new SQL queries and new database schemas. Therefore, Spider is distinct from most of the previous semantic parsing tasks because they all use a single database and have the exact same program in the train set and the test set. We experiment with various state-of-the-art models and the best model achieves only 9.7% exact matching accuracy on a database split setting. This shows that Spider presents a strong challenge for future research. Our dataset and task with the most recent updates are publicly available at https://yale-lily.github.io/seq2sql/spider.

Highlights

  • Iyer et al, 2017) are too small in terms of number of programs for training modern data-intensiveSQL Review SeQmuaenstiiocnpRaresviinegw(aSnPd) Pisaoranpehorafstehe mosFtinimalpRoervtaienwt andmPordoeclesssaindg have only a single dataset, meaning 150 man-hours tasks in nat1u5r0almlaann-ghouuargse processing (NLP)

  • In order to test a model’s real semantic parsing performance on unseen complex programs and its ability to generalize to new domains, an SP dataset that includes a large amount of complex programs and databases with multiple tables is a must

  • To address the need for a large and high-quality dataset for a new complex and cross-domain semantic parsing task, we introduce Spider, which consists of 200 databases with multiple tables, 10,181 questions, and 5,693 corresponding complex SQL queries, all written by 11 college students spending a total of 1,000 man-hours

Read more

Summary

Introduction

Iyer et al, 2017) are too small in terms of number of programs for training modern data-intensive. In order to test a model’s real semantic parsing performance on unseen complex programs and its ability to generalize to new domains, an SP dataset that includes a large amount of complex programs and databases with multiple tables is a must. To address the need for a large and high-quality dataset for a new complex and cross-domain semantic parsing task, we introduce Spider, which consists of 200 databases with multiple tables, 10,181 questions, and 5,693 corresponding complex SQL queries, all written by 11 college students spending a total of 1,000 man-hours. Since Spider contains 200 databases with foreign keys, we can split the dataset with complex SQL queries in a way that no database overlaps in train and test, which overcomes the two shortcomings of prior datasets, and defines a new semantic parsing task in which the model needs to generalize to new programs and to new databases. This suggests that there is a large room for improvement

Related Work and Existing Datasets
Corpus Construction
Database Collection and Creation
Question and SQL Annotation
SQL Review
Question Review and Paraphrase
Final Review
Dataset Statistics and Comparison
Task Definition
Evaluation Metrics
Methods
Experimental Results and Discussion
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.