Abstract

Because open-domain dialogues allow diverse responses, basic reference-based metrics such as BLEU do not work well unless we prepare a massive reference set of high-quality responses for input utterances. To reduce this burden, a human-aided, uncertainty-aware metric, ΔBLEU, has been proposed; it embeds human judgment on the quality of reference outputs into the computation of multiple-reference BLEU. In this study, we instead propose a fully automatic, uncertainty-aware evaluation method for open-domain dialogue systems, υBLEU. This method first collects diverse reference responses from massive dialogue data and then annotates their quality judgments by using a neural network trained on automatically collected training data. Experimental results on massive Twitter data confirmed that υBLEU is comparable to ΔBLEU in terms of its correlation with human judgment and that the state of the art automatic evaluation method, RUBER, is improved by integrating υBLEU.

Highlights

  • There has been increasing interest in intelligent dialogue agents such as Apple Siri, Amazon Alexa, and Google Assistant

  • The major challenge in developing open-domain dialogue systems is that existing evaluation metrics for text generation tasks, such as BLEU (Papineni et al, 2002), correlate poorly with human judgment on evaluating responses generated by dialogue systems (Liu et al, 2016)

  • We used reference responses extended by the proposed method for υBLEU in the following evaluation

Read more

Summary

Introduction

There has been increasing interest in intelligent dialogue agents such as Apple Siri, Amazon Alexa, and Google Assistant. In open-domain dialogues, even though responses with various contents and styles are acceptable (Sato et al, 2017), only a few responses, or often only one, are available as reference responses in evaluation datasets made from actual conversations It is, hard for these reference-based metrics to consider uncertain responses without writing additional reference responses by hand (§ 2). The key idea behind ∆BLEU is to consider human judgments on reference responses with diverse quality in BLEU computation. To remove the human intervention in ∆BLEU, we propose an automatic, uncertainty-aware evaluation metric, υBLEU This metric exploits reference responses that are retrieved from massive dialogue logs and rated by a neural network trained with automatically collected training data (§ 4). We showed that integrating υBLEU into RUBER greatly improves RUBER’s performance by providing the robustness to evaluate responses with uncertainty

Related work
Preliminaries
Experimental Settings
Twitter dialogue datasets
Target responses for evaluation
NN-rater to evaluate reference responses
Response retrieval and scoring
Compared response evaluation methods
Results
Conclusions
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call