Abstract

Recent work on transformer-based neural networks has led to impressive advances on multiple-choice natural language processing (NLP) problems, such as Question Answering (QA) and abductive reasoning. Despite these advances, there is limited work still on systematically evaluating such models in ambiguous situations where (for example) no correct answer exists for a given prompt among the provided set of choices. Such ambiguous situations are not infrequent in real world applications. We design and conduct an experimental study of this phenomenon using three probes that aim to 'confuse' the model by perturbing QA instances in a consistent and well-defined manner. Using a detailed set of results based on an established transformer-based multiple-choice QA system on two established benchmark datasets, we show that the model's confidence in its results is very different from that of an expected model that is 'agnostic' to all choices that are incorrect. Our results suggest that high performance on idealized QA instances should not be used to infer or extrapolate similarly high performance on more ambiguous instances. Auxiliary results suggest that the model may not be able to distinguish between these two situations with sufficient certainty. Stronger testing protocols and benchmarking may hence be necessary before such models are deployed in front-facing systems or ambiguous decision making with significant human impact.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call