Evaluating Question Answering Evaluation

Anthony Chen,Sameer Singh,Matt Gardner,Gabriel Stanovsky

doi:10.18653/v1/d19-5817

Abstract

As the complexity of question answering (QA) datasets evolve, moving away from restricted formats like span extraction and multiple-choice (MC) to free-form answer generation, it is imperative to understand how well current metrics perform in evaluating QA. This is especially important as existing metrics (BLEU, ROUGE, METEOR, and F1) are computed using n-gram similarity and have a number of well-known drawbacks. In this work, we study the suitability of existing metrics in QA. For generative QA, we show that while current metrics do well on existing datasets, converting multiple-choice datasets into free-response datasets is challenging for current metrics. We also look at span-based QA, where F1 is a reasonable metric. We show that F1 may not be suitable for all extractive QA tasks depending on the answer types. Our study suggests that while current metrics may be suitable for existing QA datasets, they limit the complexity of QA datasets that can be created. This is especially true in the context of free-form QA, where we would like our models to be able to generate more complex and abstractive answers, thus necessitating new metrics that go beyond n-gram based matching. As a step towards a better QA metric, we explore using BERTScore, a recently proposed metric for evaluating translation, for QA. We find that although it fails to provide stronger correlation with human judgements, future work focused on tailoring a BERT-based metric to QA evaluation may prove fruitful.

Highlights

Question answering (QA) has emerged as a burgeoning research field driven by the availability of large datasets
For the SemEval dataset, which we converted to a generative question answering (QA) dataset from a multiple-choice dataset, we find that existing metrics do considerably worse compared to NarrativeQA
We present a systematic study of existing n-gram based metrics by comparing their correlation to human accuracy judgements on three QA datasets

Summary

Introduction

Question answering (QA) has emerged as a burgeoning research field driven by the availability of large datasets. Despite the value of metrics as drivers of research, a comprehensive study of QA metrics across a number of datasets has yet to be completed This is important as present metrics are based on n-gram matching, which have a number of shortcomings (Figure 1). For the generative NarrativeQA dataset, we find that existing metrics provide reasonable correlation with human accuracy judgements while still leaving considerable room for improvement. We find existing n-gram based metrics perform considerably worse in comparison to NarrativeQA These results signify that as QA systems are expected to perform more free-form answer generation, new metrics will be required. BERTScore computes a score by leveraging contextualized word representations, allowing it to go beyond exact match and capture paraphrases better We find that it falls behind existing metrics on all three datasets. Our results indicate that studying the evaluation of QA is an underresearched area with substantial room for further experimentation

Metrics

Datasets

Models

Collecting Human Judgements

Correlation with Human Judgements

Discussion

Related Work

Conclusion