Bootstrapping an End-to-End Natural Language Interface for Databases

Nathaniel Weir,Prasetya Utama

doi:10.1145/3299869.3300105

Abstract

The ability to extract insights from data is critical for decision making. Intuitive natural language interfaces to databases provide non-technical users with an effective way to formulate complex questions and information needs efficiently and effectively. A recent trend in the area of Natural Language Interfaces for Databases (NLIDBs) has been the use of neural machine translation models to synthesize executable Structured Query Language (SQL) queries from natural language utterances. The main bottleneck in this type of approach is the acquisition of examples for training the model. Recent work has assumed access to a rich manually-curated training set for a given target database. However, this assumption ignores the large manual overhead required to curate the training set for any new database. As a result, NLIDB systems that can simply 'plug in' to any new database and perform effectively for naive users have yet to make their way into commercial products. Here we present DBPal, an end-to-end NLIDB framework in which a neural translation model is trained for any new database schema with minimal manual overhead. In addition to being the first off-the-shelf, neural machine translationbased system of its kind, the contributions of our project are 1) its use of a synthetic training set generation pipeline used to bootstrap a translation model without requiring manually curated data, and 2) its use of state-of-the-art multi-task and cross-domain learning techniques that increases the robustness of the translation model towards unseen linguistic phenomena in new domains. In experiments we show that our system can achieve competitive performance on the recently released benchmarks for nl-to-sql translation. Through ablation experiments we show the benefit of using cross-domain learning techniques on the performance of the system. In a user study we show that DBPal outperforms a well-known rule-based NLIDB and performs comparably to an approach using a similar neural model that relies on manually curated data.

Full Text