Abstract

Because open-domain dialogues allow diverse responses, common reference-based metrics for text generation, such as bleu, do not correlate well with human judgments unless we prepare an extensive reference set of high-quality responses for input utterances. In this study, we propose a fully automatic, uncertainty-aware evaluation method for open-domain dialogue systems, υbleu. Our method first collects diverse reference responses from massive dialogue data, annotates their quality judgments by using a neural network trained on automatically collected training data, and then computes weighted bleu using the automatically-retrieved and -rated reference responses. We also employ this method with an embedding-based metric, bertscore, instead of the word-overlap-based metric, bleu, to absorb surface variations of the reference responses. The experimental results on the meta-evaluation of our evaluation method for dialogue systems based on massive Twitter data confirmed that our method substantially improves correlations between bleu (or bertscore) and human judgments. We also confirmed that our method is effective when it is combined with a reference-free metric.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call