Abstract

Automatic Question Generation (AQG) systems are applied in a myriad of domains to generate questions from sources such as documents, images, knowledge graphs to name a few. With the rising interest in such AQG systems, it is equally important to recognize structured data like tables while generating questions from documents. In this paper, we propose a single model architecture for question generation from tables along with text using “Text-to-Text Transfer Transformer” (T5) - a fully end-to-end model which does not rely on any intermediate planning steps, delexicalization, or copy mechanisms. We also present our systematic approach in modifying the ToTTo dataset, release the augmented dataset as TabQGen along with the scores achieved using T5 as a baseline to aid further research.

Highlights

  • The development of end-to-end supervised Question-Answering (QA) models has been accelerated with the advent of large-scale datasets

  • The Stanford Question Answering Dataset (SQUAD) [5] is a reading comprehension dataset composed of questions from Wikipedia articles, with the answer to each question being a part of the corresponding reading passage

  • We emphasize the need for Automatic Question Generation (AQG) systems to effectively utilize all the available data in source documents and propose an Answer-Aware Question Generation system using T5 to generate questions from both tabular and textual data

Read more

Summary

Introduction

The development of end-to-end supervised Question-Answering (QA) models has been accelerated with the advent of large-scale datasets. The Stanford Question Answering Dataset (SQUAD) [5] is a reading comprehension dataset composed of questions from Wikipedia articles, with the answer to each question being a part of the corresponding reading passage. Microsoft Machine Reading Comprehension (MS MARCO) [6] is a large-scale dataset focused on reading comprehension, question answering, passage ranking, Keyphrase Extraction, and Conversational Search Studies. TriviaQA [7] is a realistic text-based question-answer dataset with 950K question-answer pairings extracted from Wikipedia and the internet. Since the answers to questions may not be acquired via span prediction, TriviaQA is more challenging than traditional QA benchmark datasets such as SQuAD. DuoRC [8] comprises 186K distinct question-answer combinations derived from 7680 pairs of movie plots, each pair representing two different versions of the same film and highlights the challenges of combining knowledge and reasoning in neural architectures for reading comprehension

Methods
Results
Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.