Abstract

A large number of reading comprehension (RC) datasets has been created recently, but little analysis has been done on whether they generalize to one another, and the extent to which existing datasets can be leveraged for improving performance on new ones. In this paper, we conduct such an investigation over ten RC datasets, training on one or more source RC datasets, and evaluating generalization, as well as transfer to a target RC dataset. We analyze the factors that contribute to generalization, and show that training on a source RC dataset and transferring to a target dataset substantially improves performance, even in the presence of powerful contextual representations from BERT (Devlin et al., 2019). We also find that training on multiple source RC datasets leads to robust generalization and transfer, and can reduce the cost of example collection for a new RC dataset. Following our analysis, we propose MultiQA, a BERT-based model, trained on multiple RC datasets, which leads to state-of-the-art performance on five RC datasets. We share our infrastructure for the benefit of the research community.

Highlights

  • Reading comprehension (RC) is concerned with reading a piece of text and answering questions about it (Richardson et al, 2013; Berant et al, 2014; Hermann et al, 2015; Rajpurkar et al, 2016)

  • We find the answer is a conclusive yes, as we obtain consistent improvements in our BERT-based RC model

  • We find that training on multiple source RC datasets is effective for both generalization and transfer

Read more

Summary

Introduction

Reading comprehension (RC) is concerned with reading a piece of text and answering questions about it (Richardson et al, 2013; Berant et al, 2014; Hermann et al, 2015; Rajpurkar et al, 2016). An interesting question is whether such pre-training improves performance even in the presence of powerful language representations from BERT. We find that when using the high capacity BERT-large, one can train a single model on multiple RC datasets, and obtain close to or better than state-of-the-art performance on all of them, without fine-tuning to a particular dataset. We will open source our infrastructure, which will help researchers evaluate models on a large number of datasets, and gain insight on the strengths and shortcoming of their methods We hope this will accelerate progress in language understanding. Pre-training on a RC dataset and fine-tuning on a target dataset substantially improves performance even in the presence of contextualized word representations (BERT). The code for the AllenNLP models is available at http://github.com/alontalmor/ multiqa

Datasets
Models
Do models generalize to unseen datasets?
Does pre-training improve results on small datasets?
Does context augmentation improve performance?
MULTIQA
Findings
Conclusions

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.