Qasar: Self-Supervised Learning Framework for Extractive Question Answering

Haytham Assem,Sourav Dutta,Rajdeep Sarkar

doi:10.1109/bigdata52589.2021.9671570

Abstract

Question Answering (QA) has become a foundational research area in Natural Language Understanding (NLU) with widespread applications in search, personal digital assistance, and conversational systems. Despite the success in open-domain question answering, existing extractive question answering models pre-trained using Wikipedia articles (e.g., SQuAD data) perform rather poorly in closed-domain and industrial scenarios. Further, a major limitation in adapting question answering systems to such contexts is the poor availability and the expensive annotation of domain-specific data. Thus, wide applicability of QA models are severely hampered in enterprise systems.In this paper, we aim to address the above challenges by introducing a novel QA framework, Qasar, using self-supervised learning for efficient domain adaptation. We show, for the first time, the advantage of fine-tuning pre-trained QA models for closed-domains by synthetically generated domain-specific questions and answers (from relevant documents) from large language models like T5. Further, we also propose a novel context retrieval component based on question-context semantic relatedness to further boost the accuracy of the Qasar QA framework. Experimental results show significant performance improvements on both open-and closed-domain QA datasets, while requiring no labelling efforts, which we believe will contribute to the ease of deployment of such systems in enterprise settings. The different modules of our framework (synthetic data generation, context retrieval, and question answering) can be fully reproduced by fine-tuning publicly available language models and QA models on SQuAD dataset as discussed in the paper.

Full Text