Guided Dialog Policy Learning: Reward Estimation for Multi-Domain Task-Oriented Dialog

Ryuichi Takanobu,Hanlin Zhu,Minlie Huang

doi:10.18653/v1/d19-1010

Abstract

Dialog policy decides what and how a task-oriented dialog system will respond, and plays a vital role in delivering effective conversations. Many studies apply Reinforcement Learning to learn a dialog policy with the reward function which requires elaborate design and pre-specified user goals. With the growing needs to handle complex goals across multiple domains, such manually designed reward functions are not affordable to deal with the complexity of real-world tasks. To this end, we propose Guided Dialog Policy Learning, a novel algorithm based on Adversarial Inverse Reinforcement Learning for joint reward estimation and policy optimization in multi-domain task-oriented dialog. The proposed approach estimates the reward signal and infers the user goal in the dialog sessions. The reward estimator evaluates the state-action pairs so that it can guide the dialog policy at each dialog turn. Extensive experiments on a multi-domain dialog dataset show that the dialog policy guided by the learned reward function achieves remarkably higher task success than state-of-the-art baselines.

Highlights

Dialog policy, deciding the action that the dialog agent should take at each turn, is a crucial component of a task-oriented dialog system
Reinforcement Learning (RL) is commonly used to learn dialog policy (Fatemi et al, 2016; Peng et al, 2017; Chen et al, 2017; Yarats and Lewis, 2018; Lei et al, 2018; He et al, 2018; Su et al, 2018), where users are modeled as a part of the environment and the policy is learned through interactions with users
We propose a guided policy learning method for joint reward estimation and policy optimization in multi-domain task-oriented dialog

Summary

Introduction

Dialog policy, deciding the action that the dialog agent should take at each turn, is a crucial component of a task-oriented dialog system. Unlike a handcrafted reward function that only evaluates the task success at the end of a session, a good reward function should be able to guide the policy dynamically to complete the task during the conversation We refer to this as the reward sparsity issue. To deal with reward sparsity, the reward estimator evaluates the generated dialog session using state-action pairs instead of the entire session, which provides reward signals at each dialog turn and guides dialog policy learning better. The corpus involves large state and action spaces, multiple decision making in one turn, which makes it more challenging for the reward estimator to infer the user goal. We build a reward estimator via Inverse Reinforcement Learning (IRL) to infer an appropriate reward from multi-domain dialog sessions, in order to avoid manual design of reward function. Results show the superiority of our model to the state-of-the-art baselines

Multi-Domain Dialog Policy Learning

Reward Learning in Dialog Systems

Adversarial Inverse Reinforcement Learning

Overview

Multi-Domain Dialog State Tracker

Session Level Reward Estimation

State-Action Level Reward Estimation

Data and Simulators

Evaluation Metrics

Implementation Details

Baselines

Main Results

ALDM PPO

Ablation Study

Interaction with Neural Simulator

Goal across Multiple Domains

Human Evaluation

Reward Evaluation

Discussion

Full Text

Published version (

Free)

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Guided Dialog Policy Learning: Reward Estimation for Multi-Domain Task-Oriented Dialog

Abstract

Highlights

Summary

Talk to us

Similar Papers

Lead the way for us

Publication Date: Jan 1, 2019
Citations: 69	License type: cc-by

Similar Papers

Iterative policy learning in end-to-end trainable task-oriented neural dialog models
Ian Lane ... Bing Liu
-
Ian Lane, et. al.Ian Lane ... Bing Liu
01 Dec 2017
01 Dec 2017

DORA: Towards policy optimization for task-oriented dialogue system with efficient context
Gary Geunbae Lee ... Hyunmin Jeon
Computer Speech & Language | VOL. 72
Gary Geunbae Lee, et. al.Gary Geunbae Lee ... Hyunmin Jeon
09 Oct 2021
Computer Speech & Language | VOL. 72

Adversarial Learning of Task-Oriented Neural Dialog Models
Bing Liu ... Ian Lane
-
Bing Liu, et. al.Bing Liu ... Ian Lane
01 Jan 2018
01 Jan 2018

Experience Replay-based Deep Reinforcement Learning for Dialogue Management Optimisation
Piyush Kumar ... Shrikant Malviya
ACM Transactions on Asian and Low-Resource Language Information Processing | VOL. -
Piyush Kumar, et. al.Piyush Kumar ... Shrikant Malviya
25 May 2022
ACM Transactions on Asian and Low-Resource Language Information Processing | VOL. -

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Guided Dialog Policy Learning: Reward Estimation for Multi-Domain Task-Oriented Dialog

Abstract

Highlights

Summary

Talk to us

Similar Papers