Columns Occurrences Graph to Improve Column Prediction in Deep Learning Nlidb

Muhammad Umair Khan,Scott Uk-Jin Lee,Asad Abbas,Shanza Abbas

doi:10.3390/app112412116

Abstract

Natural language interfaces to databases (NLIDB) has been a research topic for a decade. Significant data collections are available in the form of databases. To utilize them for research purposes, a system that can translate a natural language query into a structured one can make a huge difference. Efforts toward such systems have been made with pipelining methods for more than a decade. Natural language processing techniques integrated with data science methods are researched as pipelining NLIDB systems. With significant advancements in machine learning and natural language processing, NLIDB with deep learning has emerged as a new research trend in this area. Deep learning has shown potential for rapid growth and improvement in text-to-SQL tasks. In deep learning NLIDB, closing the semantic gap in predicting users’ intended columns has arisen as one of the critical and fundamental problems in this research field. Contributions toward this issue have consisted of preprocessed feature inputs and encoding schema elements afore of and more impactful to the targeted model. Various significant work contributed towards this problem notwithstanding, this has been shown to be one of the critical issues for the task of developing NLIDB. Working towards closing the semantic gap between user intention and predicted columns, we present an approach for deep learning text-to-SQL tasks that includes previous columns’ occurrences scores as an additional input feature. Overall exact match accuracy can also be improved by emphasizing the improvement of columns’ prediction accuracy, which depends significantly on column prediction itself. For this purpose, we extract the query fragments from previous queries’ data and obtain the columns’ occurrences and co-occurrences scores. Column occurrences and co-occurrences scores are processed as input features for the encoder–decoder-based text to the SQL model. These scores contribute, as a factor, the probability of having already used columns and tables together in the query history. We experimented with our approach on the currently popular text-to-SQL dataset Spider. Spider is a complex data set containing multiple databases. This dataset includes query–question pairs along with schema information. We compared our exact match accuracy performance with a base model using their test and training data splits. It outperformed the base model’s accuracy, and accuracy was further boosted in experiments with the pretrained language model BERT.

Highlights

We have focused on a natural language interface to databases by the generation of executable structured query language (SQL) queries
Our decoder is similar to the base model SyntaxsqlNet, our columns occurrences score, in the encoder, allows the model to include user intentions regarding column prediction from the database
We extended the base SyntaxSQLNet approach to emphasize column prediction accuracy with the help of a column occurrences scores graph

Summary

Introduction

Enormous databases have come to contain substantial knowledge about an organization because of the digital storage of data. These vast data repositories can contribute to research in data analysis and finding trends and patterns therein, according to any particular research goal. Users have needed to learn structured query languages such as SQL to get precise results from relational databases. All experts of a particular domain, for example, medicine, do not necessarily know structured query languages, limiting the access of organizational knowledge to a limited number of users. Similar words have similar and so closer representations in a predefined vector space. This vector representation is learned from a predefined, fixed-size library. Word2Vec is a statistical technique for capturing the local meaning and context of a stand-alone text corpus [17,18]

Objectives

Methods

Results

Conclusion