Spider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to-SQL Task

Tao Yu,James Ma,Dragomir Radev,Kai Yang,Zifan Li,Dongxu Wang,Zilin Zhang,Michihiro Yasunaga,Qingning Yao,Shanelle Roman,Irene Li,Rui Zhang

doi:10.18653/v1/d18-1425

Abstract

We present Spider, a large-scale complex and cross-domain semantic parsing and text-to-SQL dataset annotated by 11 college students. It consists of 10,181 questions and 5,693 unique complex SQL queries on 200 databases with multiple tables covering 138 different domains. We define a new complex and cross-domain semantic parsing and text-to-SQL task so that different complicated SQL queries and databases appear in train and test sets. In this way, the task requires the model to generalize well to both new SQL queries and new database schemas. Therefore, Spider is distinct from most of the previous semantic parsing tasks because they all use a single database and have the exact same program in the train set and the test set. We experiment with various state-of-the-art models and the best model achieves only 9.7% exact matching accuracy on a database split setting. This shows that Spider presents a strong challenge for future research. Our dataset and task with the most recent updates are publicly available at https://yale-lily.github.io/seq2sql/spider.

Highlights

Iyer et al, 2017) are too small in terms of number of programs for training modern data-intensiveSQL Review SeQmuaenstiiocnpRaresviinegw(aSnPd) Pisaoranpehorafstehe mosFtinimalpRoervtaienwt andmPordoeclesssaindg have only a single dataset, meaning 150 man-hours tasks in nat1u5r0almlaann-ghouuargse processing (NLP)
In order to test a model’s real semantic parsing performance on unseen complex programs and its ability to generalize to new domains, an SP dataset that includes a large amount of complex programs and databases with multiple tables is a must
To address the need for a large and high-quality dataset for a new complex and cross-domain semantic parsing task, we introduce Spider, which consists of 200 databases with multiple tables, 10,181 questions, and 5,693 corresponding complex SQL queries, all written by 11 college students spending a total of 1,000 man-hours

Summary

Introduction

Iyer et al, 2017) are too small in terms of number of programs for training modern data-intensive. In order to test a model’s real semantic parsing performance on unseen complex programs and its ability to generalize to new domains, an SP dataset that includes a large amount of complex programs and databases with multiple tables is a must. To address the need for a large and high-quality dataset for a new complex and cross-domain semantic parsing task, we introduce Spider, which consists of 200 databases with multiple tables, 10,181 questions, and 5,693 corresponding complex SQL queries, all written by 11 college students spending a total of 1,000 man-hours. Since Spider contains 200 databases with foreign keys, we can split the dataset with complex SQL queries in a way that no database overlaps in train and test, which overcomes the two shortcomings of prior datasets, and defines a new semantic parsing task in which the model needs to generalize to new programs and to new databases. This suggests that there is a large room for improvement

Related Work and Existing Datasets

Corpus Construction

Database Collection and Creation

Question and SQL Annotation

SQL Review

Question Review and Paraphrase

Final Review

Dataset Statistics and Comparison

Task Definition

Evaluation Metrics

Methods

Experimental Results and Discussion

Conclusion

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Spider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to-SQL Task

Abstract

Highlights

Summary

Talk to us

Similar Papers

Lead the way for us

Publication Date: Jan 1, 2018
Citations: 409	License type: cc-by

Similar Papers

MicroRNA Signature Predicts Survival and Relapse in Lung Cancer
Sung-Liang Yu ...
Cancer Cell | VOL. 13
Sung-Liang Yu, et. al.Sung-Liang Yu ...
01 Jan 2008
Cancer Cell | VOL. 13

Classification of High‐Activity Tiagabine Analogs by Binary QSAR Modeling
Andreas Jurik ... Gerhard F Ecker
Molecular Informatics | VOL. 32
Andreas Jurik, et. al.Andreas Jurik ... Gerhard F Ecker
15 May 2013
Molecular Informatics | VOL. 32

S410 Algorithm Training and Independent Test Set Performance for a Molecular Non-Endoscopic Test for Detection of Esophageal Adenocarcinoma and Barrett’s Esophagus in Multicenter Cohorts
Prasad G Iyer ...
American Journal of Gastroenterology | VOL. 117
Prasad G Iyer, et. al.Prasad G Iyer ...
01 Oct 2022
American Journal of Gastroenterology | VOL. 117

Development and validation of prognostic nomogram for malignant pleural mesothelioma
H Luo ... D Han
Zhonghua zhong liu za zhi [Chinese journal of oncology] | VOL. 45
H Luo, et. al.H Luo ... D Han
23 May 2023
Zhonghua zhong liu za zhi [Chinese journal of oncology] | VOL. 45

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Spider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to-SQL Task

Abstract

Highlights

Summary

Talk to us

Similar Papers