The Dialog State Tracking Challenge Series: A Review

Jason D Williams,Matthew Henderson,Antoine Raux

doi:10.5087/dad.2016.301

Jason D Williams, Matthew Henderson + Show 1 more

Open Access

https://doi.org/10.5087/dad.2016.301

Copy DOI

Abstract

In a spoken dialog system, dialog state tracking refers to the task of correctly inferring the state of the conversation -- such as the user's goal -- given all of the dialog history up to that turn. Dialog state tracking is crucial to the success of a dialog system, yet until recently there were no common resources, hampering progress. The Dialog State Tracking Challenge series of 3 tasks introduced the first shared testbed and evaluation metrics for dialog state tracking, and has underpinned three key advances in dialog state tracking: the move from generative to discriminative models; the adoption of discriminative sequential techniques; and the incorporation of the speech recognition results directly into the dialog state tracker. This paper reviews this research area, covering both the challenge tasks themselves and summarizing the work they have enabled.

Highlights

Conversational systems are increasingly becoming a part of daily life, with examples including Apple’s Siri, Google Nuance Dragon Go, Xbox and Cortana from Microsoft, and numerous start-ups
This spoken language understanding (SLU) result is passed to the dialog state tracker (DST) which updates its estimate of the dialog state
This tracker considers the items on the SLU N -best list – it is an “oracle” in the sense that, if a slot/value pair appears that corresponds to the user’s goal, it is added to the state with confidence 1.0

Summary

Introduction

Conversational systems are increasingly becoming a part of daily life, with examples including Apple’s Siri, Google Nuance Dragon Go, Xbox and Cortana from Microsoft, and numerous start-ups. The user produces an utterance as audio. Automatic speech recognition (ASR) converts this audio into words in text form. The words in an utterance are converted to a meaning representation using spoken language understanding (SLU). This SLU result is passed to the dialog state tracker (DST) which updates its estimate of the dialog state. This new dialog state is passed to the dialog policy that decides which action to take. Natural language generation (NLG) and text-tospeech (TTS) convert this action into words and into audio.

Methods

Results

Conclusion