Evaluating Models’ Local Decision Boundaries via Contrast Sets

Matt Gardner,Nelson F Liu,Jiangming Liu,Yoav Artzi,Ananth Gottumukkala,Victoria Basmov,Yanai Elazar,Sihao Chen,Sameer Singh,Ally Zhang,Hannaneh Hajishirzi,Ben Bogin,Reut Tsarfaty,Sanjay Subramanian,Gabriel Ilharco,Qiang Ning,Noah A Smith,Phoebe Mulcaire,Pradeep Dasigi,Dheeru Dua,Daniel Khashabi,Jonathan Berant,Ben Zhou,Eric Wallace,Kevin Lin,Neelesh Gupta

doi:10.18653/v1/2020.findings-emnlp.117

Abstract

Standard test sets for supervised learning evaluate in-distribution generalization. Unfortunately, when a dataset has systematic gaps (e.g., annotation artifacts), these evaluations are misleading: a model can learn simple decision rules that perform well on the test set but do not capture a dataset's intended capabilities. We propose a new annotation paradigm for NLP that helps to close systematic gaps in the test data. In particular, after a dataset is constructed, we recommend that the dataset authors manually perturb the test instances in small but meaningful ways that (typically) change the gold label, creating contrast sets. Contrast sets provide a local view of a model's decision boundary, which can be used to more accurately evaluate a model's true linguistic capabilities. We demonstrate the efficacy of contrast sets by creating them for 10 diverse NLP datasets (e.g., DROP reading comprehension, UD parsing, IMDb sentiment analysis). Although our contrast sets are not explicitly adversarial, model performance is significantly lower on them than on the original test sets---up to 25\% in some cases. We release our contrast sets as new evaluation benchmarks and encourage future dataset construction efforts to follow similar annotation processes.

Highlights

Progress in natural language processing (NLP) has long been measured with standard benchmark datasets (e.g., Marcus et al, 1993)
Findings of the Association for Computational Linguistics: EMNLP 2020, pages 1307–1323 November 16 - 20, 2020. c 2020 Association for Computational Linguistics fills in these systematic gaps in the test set
We propose that dataset authors manually perturb instances from their test set, creating contrast sets which characterize the correct decision boundary near the test instances (Section 2)

Summary

Introduction

Progress in natural language processing (NLP) has long been measured with standard benchmark datasets (e.g., Marcus et al, 1993). These benchmarks help to provide a uniform evaluation of new modeling developments. Example Textual Perturbations: Two -colored and -posed cats are face to face in one image. Three -colored and -posed chow dogs are face to face in one image. Two differently-colored but -posed chow dogs are face to face in one image. Example Image Perturbation: Two -colored and -posed chow dogs are face to face in one image

Methods

Results

Conclusion

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Evaluating Models’ Local Decision Boundaries via Contrast Sets

Abstract

Highlights

Summary

Talk to us

Similar Papers

Lead the way for us

Publication Date: Jan 1, 2020
Citations: 181	License type: cc-by

Similar Papers

Assessment of a deep learning model for COVID-19 classification on chest radiographs: a comparison across image acquisition techniques and clinical factors.
Mena Shenouda ... Aditi Kaveti
Journal of medical imaging (Bellingham, Wash.) | VOL. 10
Mena Shenouda, et. al.Mena Shenouda ... Aditi Kaveti
28 Dec 2023
Journal of medical imaging (Bellingham, Wash.) | VOL. 10

Resource-constrained compaction of sequential circuit test sets
S.K Bommu ... S.T Chakradhar
-
S.K Bommu, et. al.S.K Bommu ... S.T Chakradhar
04 Jan 2000
04 Jan 2000

Deep Learning Test Optimization Method Using Multi-objective Optimization
Yanzhou Mu ... Junjie Chen
International Journal of Software and Informatics | VOL. 12
Yanzhou Mu, et. al.Yanzhou Mu ... Junjie Chen
01 Jan 2021
International Journal of Software and Informatics | VOL. 12

Pushing the limits of solubility prediction via quality-oriented data selection.
Murat Cihan Sorkun ... Süleyman Er
iScience | VOL. 24
Murat Cihan Sorkun, et. al.Murat Cihan Sorkun ... Süleyman Er
17 Dec 2020
iScience | VOL. 24

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Evaluating Models’ Local Decision Boundaries via Contrast Sets

Abstract

Highlights

Summary

Talk to us

Similar Papers