Abstract

BERT (Bidirectional Encoder Representations from Transformers) and related pre-trained Transformers have provided large gains across many language understanding tasks, achieving a new state-of-the-art (SOTA). BERT is pretrained on two auxiliary tasks: Masked Language Model and Next Sentence Prediction. In this paper we introduce a new pre-training task inspired by reading comprehension to better align the pre-training from memorization to understanding. Span Selection PreTraining (SSPT) poses cloze-like training instances, but rather than draw the answer from the model’s parameters, it is selected from a relevant passage. We find significant and consistent improvements over both BERT-BASE and BERT-LARGE on multiple Machine Reading Comprehension (MRC) datasets. Specifically, our proposed model has strong empirical evidence as it obtains SOTA results on Natural Questions, a new benchmark MRC dataset, outperforming BERT-LARGE by 3 F1 points on short answer prediction. We also show significant impact in HotpotQA, improving answer prediction F1 by 4 points and supporting fact prediction F1 by 1 point and outperforming the previous best system. Moreover, we show that our pre-training approach is particularly effective when training data is limited, improving the learning curve by a large amount.

Highlights

  • State-of-the-art approaches for NLP tasks are based on language models that are pre-trained on tasks which do not require labeled data (Peters et al, 2018; Howard and Ruder, 2018; Devlin et al, 2018; Yang et al, 2019; Liu et al, 2019; Sun et al, 2019)

  • We provide an extensive evaluation of the span selection pre-training method across four reading comprehension tasks: the Stanford Question Answering Dataset (SQuAD) in both version 1.1 and 2.0; followed by the Google Natural Questions dataset (Kwiatkowski et al, 2019) and a multihop Question Answering dataset, HotpotQA (Yang et al, 2018)

  • The input to BERT is a concatenation of two segments x1, . . . , xM and y1, . . . , yN separated by special delimiter markers like so: [CLS], x1, . . . , xM, [SEP ], y1, . . . , yN, [SEP ] such that M + N < S where S is the maximum sequence length allowed during training1

Read more

Summary

Introduction

State-of-the-art approaches for NLP tasks are based on language models that are pre-trained on tasks which do not require labeled data (Peters et al, 2018; Howard and Ruder, 2018; Devlin et al, 2018; Yang et al, 2019; Liu et al, 2019; Sun et al, 2019). Pre-trained transformer models do encode a substantial number of specific facts in their parameter matrices, enabling them to answer questions directly from the model itself (Radford et al, 2019). In MRC tasks, the model does not need to generate an answer it has encoded in its parameters. To better align the pre-training with the needs of the MRC task, we use span selection as an additional auxiliary task This task is similar to the cloze task, but is designed to have a fewer simple instances requiring only syntactic or collocation understanding. For cloze instances that require specific knowledge, rather than training the model to encode this knowledge in its parameterization, we provide a relevant and answer-bearing passage paired with the cloze instance.

Related Work
Background
Architecture and setup
Objective functions
Span Selection
Extended Pre-training
True Label
MRC Tasks
Natural Questions
Method
Experiments
HotpotQA
Exploration of SSPT Instance Types
Comparison to Previous Work
Findings
Conclusion and Future Work
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call