Towards an Automatic Turing Test: Learning to Evaluate Dialogue Responses

Ryan Lowe,Nicolas Angelard-Gontier,Michael Noseworthy,Joelle Pineau,Yoshua Bengio,Iulian Vlad Serban

doi:10.18653/v1/p17-1103

Abstract

Ryan Lowe, Michael Noseworthy, Iulian Vlad Serban, Nicolas Angelard-Gontier, Yoshua Bengio, Joelle Pineau. Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2017.

Highlights

Building systems that can naturally and meaningfully converse with humans has been a central goal of artificial intelligence since the formulation of the Turing test (Turing, 1950)
We show that automatic dialogue evaluation model (ADEM) can often generalize to evaluating new models, whose responses were unseen during training, making ADEM a strong first step towards effective automatic dialogue response evaluation
Such models are necessary even for creating a test set in a new domain, which will help us determine if ADEM generalizes to related dialogue domains

Summary

Introduction

Building systems that can naturally and meaningfully converse with humans has been a central goal of artificial intelligence since the formulation of the Turing test (Turing, 1950). There has been a surge of interest towards building large-scale non-task-oriented dialogue systems using neural networks (Sordoni et al, 2015b; Shang et al, 2015; Vinyals and Le, 2015; Serban et al, 2016a; Li et al, 2015). These models are trained in an end-to-end manner to optimize a single objective, usually the likelihood of generating the responses from a fixed corpus. Such models have already had a substantial impact in industry, including Google’s Smart Reply system (Kannan et al, 2016), and Microsoft’s Xiaoice chatbot (Markoff and Mozur, 2015), which has over 20 million users

Objectives

Methods

Conclusion