UBLEU: Uncertainty-Aware Automatic Evaluation Method for Open-Domain Dialogue Systems

Tsuta Yuma,Naoki Yoshinaga,Masashi Toyoda

doi:10.18653/v1/2020.acl-srw.27

Abstract

Because open-domain dialogues allow diverse responses, basic reference-based metrics such as BLEU do not work well unless we prepare a massive reference set of high-quality responses for input utterances. To reduce this burden, a human-aided, uncertainty-aware metric, ΔBLEU, has been proposed; it embeds human judgment on the quality of reference outputs into the computation of multiple-reference BLEU. In this study, we instead propose a fully automatic, uncertainty-aware evaluation method for open-domain dialogue systems, υBLEU. This method first collects diverse reference responses from massive dialogue data and then annotates their quality judgments by using a neural network trained on automatically collected training data. Experimental results on massive Twitter data confirmed that υBLEU is comparable to ΔBLEU in terms of its correlation with human judgment and that the state of the art automatic evaluation method, RUBER, is improved by integrating υBLEU.

Highlights

There has been increasing interest in intelligent dialogue agents such as Apple Siri, Amazon Alexa, and Google Assistant
The major challenge in developing open-domain dialogue systems is that existing evaluation metrics for text generation tasks, such as BLEU (Papineni et al, 2002), correlate poorly with human judgment on evaluating responses generated by dialogue systems (Liu et al, 2016)
We used reference responses extended by the proposed method for υBLEU in the following evaluation

Summary

Introduction

There has been increasing interest in intelligent dialogue agents such as Apple Siri, Amazon Alexa, and Google Assistant. In open-domain dialogues, even though responses with various contents and styles are acceptable (Sato et al, 2017), only a few responses, or often only one, are available as reference responses in evaluation datasets made from actual conversations It is, hard for these reference-based metrics to consider uncertain responses without writing additional reference responses by hand (§ 2). The key idea behind ∆BLEU is to consider human judgments on reference responses with diverse quality in BLEU computation. To remove the human intervention in ∆BLEU, we propose an automatic, uncertainty-aware evaluation metric, υBLEU This metric exploits reference responses that are retrieved from massive dialogue logs and rated by a neural network trained with automatically collected training data (§ 4). We showed that integrating υBLEU into RUBER greatly improves RUBER’s performance by providing the robustness to evaluate responses with uncertainty

Related work

Preliminaries

Experimental Settings

Twitter dialogue datasets

Target responses for evaluation

NN-rater to evaluate reference responses

Response retrieval and scoring

Compared response evaluation methods

Results

Conclusions

Full Text

Published version (

Free)

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

UBLEU: Uncertainty-Aware Automatic Evaluation Method for Open-Domain Dialogue Systems

Abstract

Highlights

Summary

Talk to us

Similar Papers

Lead the way for us

Publication Date: Jan 1, 2020
Citations: 27	License type: cc-by

Similar Papers

Uncertainty-aware Automatic Evaluation Method for Open-domain Dialogue Systems
Yuma Tsuta ... Naoki Yoshinaga
Journal of Natural Language Processing | VOL. 30
Yuma Tsuta, et. al.Yuma Tsuta ... Naoki Yoshinaga
01 Jan 2023
Journal of Natural Language Processing | VOL. 30

REAM$\sharp$: An Enhancement Approach to Reference-based Evaluation Metrics for Open-domain Dialog Generation
...
-
, et. al. ...
01 Aug 2021
01 Aug 2021

REAM♯: An Enhancement Approach to Reference-based Evaluation Metrics for Open-domain Dialog Generation: An Enhancement Approach to Reference-based Evaluation Metrics for Open-domain Dialog Generation
Jun Gao ... Ruifeng Xu
-
Jun Gao, et. al.Jun Gao ... Ruifeng Xu
01 Jan 2020
01 Jan 2020

Estimating post-editing effort : a study on human judgements, task-based and reference-based metrics of MT quality
...
arXiv (Cornell University) | VOL. -
, et. al. ...
14 Oct 2019
arXiv (Cornell University) | VOL. -

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

UBLEU: Uncertainty-Aware Automatic Evaluation Method for Open-Domain Dialogue Systems

Abstract

Highlights

Summary

Talk to us

Similar Papers