Abstract

Learning suitable and well-performing dialogue behaviour in statistical spoken dialogue systems has been in the focus of research for many years. While most work that is based on reinforcement learning employs an objective measure like task success for modelling the reward signal, we propose to use a reward signal based on user satisfaction. We propose a novel estimator and show that it outperforms all previous estimators while learning temporal dependencies implicitly. We show in simulated experiments that a live user satisfaction estimation model may be applied resulting in higher estimated satisfaction whilst achieving similar success rates. Moreover, we show that a satisfaction estimation model trained on one domain may be applied in many other domains that cover a similar task. We verify our findings by employing the model to one of the domains for learning a policy from real users and compare its performance to policies using user satisfaction and task success acquired directly from the users as reward.

Highlights

  • Spoken dialogue systems (SDSs) enable voice interaction between technical systems and humans

  • We argue that training a system to maximise user satisfaction (US) is a good alternative to TS for the following reasons: 1. User satisfaction is favourable over task success as it represents more accurately the user’s view and whether the user is likely to use the system again in the future

  • This article has demonstrated that employing a user satisfaction reward estimator for learning dialogue policies without any knowledge about the domain can yield good performance in terms of both task success rate and user satisfaction

Read more

Summary

Introduction

Spoken dialogue systems (SDSs) enable voice interaction between technical systems and humans. They have been advanced into our everyday lives. One prominent way of modelling the decision-making component of a spoken dialogue system is to use (partially observable) Markov decision processes ((PO)MDPs) (Lemon and Pietquin, 2012; Young et al, 2013). Task-oriented dialogue systems model the reward r, which is used to guide the learning process, traditionally with task success as the principal reward component (Gasicand Young, 2014; Lemon and Pietquin, 2007; Daubigney et al, 2012; Levin and Pieraccini, 1997; Singh et al, 2002; Young et al, 2013; Su et al, 2015, 2016)

Objectives
Results
Discussion
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.