A Smart System to Generate and Validate Question Answer Pairs for COVID-19 Literature

Rohan Bhambhoria,John Chen,Sedef Kocak,Dawn Sepehr,Conner Cowling,Luna Feng,Elham Dolatabadi

doi:10.18653/v1/2020.sdp-1.4

Abstract

Automatically generating question answer (QA) pairs from the rapidly growing coronavirus-related literature is of great value to the medical community. Creating high quality QA pairs would allow researchers to build models to address scientific queries for answers which are not readily available in support of the ongoing fight against the pandemic. QA pair generation is, however, a very tedious and time consuming task requiring domain expertise for annotation and evaluation. In this paper we present our contribution in addressing some of the challenges of building a QA system without gold data. We first present a method to create QA pairs from a large semi-structured dataset through the use of transformer and rule-based models. Next, we propose a means of engaging subject matter experts (SMEs) for annotating the QA pairs through the usage of a web application. Finally, we demonstrate some experiments showcasing the effectiveness of leveraging active learning in designing a high performing model with a substantially lower annotation effort from the domain experts.

Highlights

Building a question answer (QA) system is a complex process requiring advanced text mining approaches (Jothi et al, 2015) and domain expertise for model evaluation
The XGBoost model with sentence embeddings produced by BioBERT improves the F1 score by 2% compared with the one using BERT since BioBERT is pre-trained on biomedical articles which aligns with the domain of our experiment dataset
We propose a novel strategy consisting of transformer and rule-based methods to generate QA pairs from scientific literature gathered in the CORD-19, while making use of a validation procedure to maintain the quality

Summary

Introduction

Building a QA system is a complex process requiring advanced text mining approaches (Jothi et al, 2015) and domain expertise for model evaluation. Automatically generating questionanswer pairs using recent advances in natural language processing (NLP) models has gained much attention from researchers and has achieved impressive results on various publicly available datasets. We explore the COVID-19 Open Research Dataset (CORD-19) (Wang et al, 2020) first in-. The competition has been launched as a call to action for machine learning researchers to assist the medical community in developing answers to high-priority scientific questions related to COVID-19. A major challenge in dealing with a large semi-structured dataset (i.e., scholarly articles) is the lack of gold data which we aim to address in this work.

Objectives

Methods

Results

Conclusion