Abstract

Dialog policy decides what and how a task-oriented dialog system will respond, and plays a vital role in delivering effective conversations. Many studies apply Reinforcement Learning to learn a dialog policy with the reward function which requires elaborate design and pre-specified user goals. With the growing needs to handle complex goals across multiple domains, such manually designed reward functions are not affordable to deal with the complexity of real-world tasks. To this end, we propose Guided Dialog Policy Learning, a novel algorithm based on Adversarial Inverse Reinforcement Learning for joint reward estimation and policy optimization in multi-domain task-oriented dialog. The proposed approach estimates the reward signal and infers the user goal in the dialog sessions. The reward estimator evaluates the state-action pairs so that it can guide the dialog policy at each dialog turn. Extensive experiments on a multi-domain dialog dataset show that the dialog policy guided by the learned reward function achieves remarkably higher task success than state-of-the-art baselines.

Highlights

  • Dialog policy, deciding the action that the dialog agent should take at each turn, is a crucial component of a task-oriented dialog system

  • Reinforcement Learning (RL) is commonly used to learn dialog policy (Fatemi et al, 2016; Peng et al, 2017; Chen et al, 2017; Yarats and Lewis, 2018; Lei et al, 2018; He et al, 2018; Su et al, 2018), where users are modeled as a part of the environment and the policy is learned through interactions with users

  • We propose a guided policy learning method for joint reward estimation and policy optimization in multi-domain task-oriented dialog

Read more

Summary

Introduction

Dialog policy, deciding the action that the dialog agent should take at each turn, is a crucial component of a task-oriented dialog system. Unlike a handcrafted reward function that only evaluates the task success at the end of a session, a good reward function should be able to guide the policy dynamically to complete the task during the conversation We refer to this as the reward sparsity issue. To deal with reward sparsity, the reward estimator evaluates the generated dialog session using state-action pairs instead of the entire session, which provides reward signals at each dialog turn and guides dialog policy learning better. The corpus involves large state and action spaces, multiple decision making in one turn, which makes it more challenging for the reward estimator to infer the user goal. We build a reward estimator via Inverse Reinforcement Learning (IRL) to infer an appropriate reward from multi-domain dialog sessions, in order to avoid manual design of reward function. Results show the superiority of our model to the state-of-the-art baselines

Multi-Domain Dialog Policy Learning
Reward Learning in Dialog Systems
Adversarial Inverse Reinforcement Learning
Overview
Multi-Domain Dialog State Tracker
Session Level Reward Estimation
State-Action Level Reward Estimation
Data and Simulators
Evaluation Metrics
Implementation Details
Baselines
Main Results
ALDM PPO
Ablation Study
Interaction with Neural Simulator
Goal across Multiple Domains
Human Evaluation
Reward Evaluation
Discussion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call