A Review of Datasets for NLIDBs

Rakesh Balabantaray,Alaka Das

doi:10.1007/978-981-19-9304-6_21

Abstract

Natural language interface to database (NLIDB) is a research area that is gradually merging natural language processing (NLP) and data Sciences, specifically after the rise of deep learning. Traditional NLIDBs were built based on rules, syntactic and semantic grammars, statistical and ontology-based techniques and methods from NLP. Most of those systems usually were supporting controlled natural language queries (NLQs) where parsing error and ambiguities were manually handled by the users. As deep learning (DL) captures complex dependencies,human interventions is getting reduced. DL-based NLIDBs are again in focus and many recent NLIDBs got satisfactory performance by using sequence-to-sequence DL model and semantic parsing text-to-SQL datasets. These DL end to end models require huge, complex and cross-domain labeled dataset for exhaustive training. As none of the existing datasets for text-to-SQL tasks seems to be perfect, to meet the demand, new datasets are being produced in every couple of years. Though the role of dataset is very crucial for natural language interface, there are very few review papers focusing on datasets. This paper reviews the datasets used in different NLIDBs, discusses their importance and presents a summary report.

Full Text