Abstract

Recent work has shown that pre-trained language models such as BERT improve robustness to spurious correlations in the dataset. Intrigued by these results, we find that the key to their success is generalization from a small amount of counterexamples where the spurious correlations do not hold. When such minority examples are scarce, pre-trained models perform as poorly as models trained from scratch. In the case of extreme minority, we propose to use multi-task learning (MTL) to improve generalization. Our experiments on natural language inference and paraphrase identification show that MTL with the right auxiliary tasks significantly improves performance on challenging examples without hurting the in-distribution performance. Further, we show that the gain from MTL mainly comes from improved generalization from the minority examples. Our results highlight the importance of data diversity for overcoming spurious correlations. 1

Highlights

  • A key challenge in building robust NLP models is the gap between limited linguistic variations in the training data and the diversity in real-world languages

  • Minority, we empirically show that multi-task learning (MTL) improves robust accuracy by improving generalization from the minority examples, even though preivous work has suggested that MTL has limited advantage in i.i.d. settings (Søgaard and Goldberg, 2016; Hashimoto et al, 2017)

  • We focus on two natural language understanding tasks, natural language inference (NLI) and paraphrase identification (PI)

Read more

Summary

Introduction

A key challenge in building robust NLP models is the gap between limited linguistic variations in the training data and the diversity in real-world languages. In natural language inference (NLI) tasks, previous work has found that models learned on notable benchmarks achieve high accuracy by associating high word overlap between the premise and the hypothesis with entailment (Dasgupta et al, 2018; McCoy et al, 2019). These models perform poorly on the so-called challenging or adversarial datasets, where such correlations no longer hold (Glockner et al, 2018; McCoy et al, 2019; Nie et al, 2019; Zhang et al, 2019). Recent empirical results have suggested that self-supervised pre-training improves robust accuracy, while not using any task-specific knowledge nor incurring in-distribution accuracy drop (Hendrycks et al, 2019, 2020)

Objectives
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call