How Can We Know When Language Models Know? On the Calibration of Language Models for Question Answering

Zhengbao Jiang,Haibo Ding,Jun Araki,Graham Neubig

doi:10.1162/tacl_a_00407

Abstract

Abstract Recent works have shown that language models (LM) capture different types of knowledge regarding facts or common sense. However, because no model is perfect, they still fail to provide appropriate answers in many cases. In this paper, we ask the question, “How can we know when language models know, with confidence, the answer to a particular query?” We examine this question from the point of view of calibration, the property of a probabilistic model’s predicted probabilities actually being well correlated with the probabilities of correctness. We examine three strong generative models—T5, BART, and GPT-2—and study whether their probabilities on QA tasks are well calibrated, finding the answer is a relatively emphatic no. We then examine methods to calibrate such models to make their confidence scores correlate better with the likelihood of correctness through fine-tuning, post-hoc probability modification, or adjustment of the predicted outputs or inputs. Experiments on a diverse range of datasets demonstrate the effectiveness of our methods. We also perform analysis to study the strengths and limitations of these methods, shedding light on further improvements that may be made in methods for calibrating LMs. We have released the code at https://github.com/jzbjyb/lm-calibration.

Highlights

Language models (LMs; Church, 1988; Bengio et al, 2003; Radford et al, 2019) learn to model the probability distribution of text, and in doing so capture information about various aspects of the syntax or semantics of the language at hand
Since we focus on calibrating LMs as generators, we follow Khashabi et al (2020) in converting question answering (QA) datasets of different formats to a unified sequence-to-sequence format that takes a question X as input and calculates the probability of a continuation Y that corresponds to the answer:
In addition to standard methods that are applicable to most prediction models, we examine several methods that are specific to the fact that we are using LMs for the task of QA

Summary

Introduction

Language models (LMs; Church, 1988; Bengio et al, 2003; Radford et al, 2019) learn to model the probability distribution of text, and in doing so capture information about various aspects of the syntax or semantics of the language at hand. The high performance on datasets probing factual or numerical knowledge might be achieved through modeling superficial signals in the training data that are not generalizable to unseen test cases (Poerner et al, 2019; Zhou et al, 2020; Wallace et al, 2019; Talmor et al, 2019a) If such models are to be deployed in real applications it is of crucial importance to determine the confidence with which they can provide an answer. We approximate this probability by bucketing predictions into M disjoint sized interval bins based on confidence. Guo et al (2017) examined the calibration properties of neural network classifiers, and proposed a widely used measure of calibration called expected calibration error (ECE), which is a weighted average of the discrepancy between each bucket’s accuracy and confidence: M

Methods

Results

Conclusion

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Transactions of the Association for Computational Linguistics	Publication Date: Sep 8, 2021
Citations: 39	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

How Can We Know When Language Models Know? On the Calibration of Language Models for Question Answering

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Transactions of the Association for Computational Linguistics

Lead the way for us

Similar Papers

Integrating Knowledge in Search of Biologically Relevant Genes
Zheng Zhao ... Jiangxin Wang
-
Zheng Zhao, et. al.Zheng Zhao ... Jiangxin Wang
01 Dec 2009
01 Dec 2009

An Integrative Approach to Identifying Biologically Relevant Genes
Zheng Zhao ... Jiangxin Wang
-
Zheng Zhao, et. al.Zheng Zhao ... Jiangxin Wang
29 Apr 2010
29 Apr 2010

Introduction: Progress in formal commonsense reasoning
Ernest Davis ... Leora Morgenstern
Artificial Intelligence | VOL. 153
Ernest Davis, et. al.Ernest Davis ... Leora Morgenstern
03 Dec 2003
Artificial Intelligence | VOL. 153

Are AI language models such as ChatGPT ready to improve the care of individuals with epilepsy?
Christian M Boßelmann ... Costin Leu
Epilepsia | VOL. 64
Christian M Boßelmann, et. al.Christian M Boßelmann ... Costin Leu
13 Mar 2023
Epilepsia | VOL. 64

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

How Can We Know When Language Models Know? On the Calibration of Language Models for Question Answering

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Transactions of the Association for Computational Linguistics