Abstract

Conversational spoken dialogue systems that interact with the user rather than merely reading the text can be equipped with hesitations to manage dialogue flow and user attention. Based on a series of empirical studies, we elaborated a hesitation synthesis strategy for dialogue systems, which inserts hesitations of a scalable extent wherever needed in the ongoing utterance. Previously, evaluations of hesitation systems have shown that synthesis quality is affected negatively by hesitations, but that they result in improvements of interaction quality. We argue that due to its conversational nature, hesitation synthesis needs interactive evaluation rather than traditional mean opinion score (MOS)-based questionnaires. To validate this claim, we dually evaluate our system’s speech synthesis component, on the one hand, linked to the dialogue system evaluation, and on the other hand, in a traditional MOS way. We are thus able to analyze and discuss differences that arise due to the evaluation methodology. Our results suggest that MOS scales are not sufficient to assess speech synthesis quality, leading to implications for future research that are discussed in this paper. Furthermore, our results indicate that synthetic hesitations are able to increase task performance and that an elaborated hesitation strategy is necessary to avoid likability issues.

Highlights

  • The study we are presenting in this paper rests on the assumption that this is suboptimal for many human–machine interactions where listeners need to process information that is synthetically generated, and where a human speaker would try to deliver the information in a way which is suited to the Multimodal Technologies and Interact. 2018, 2, 9; doi:10.3390/mti2010009

  • The results gathered in this preliminary testing of the hesitation model followed the expected directions

  • Speech synthesis quality suffers from the presence of hesitation, but task performance appears to benefit from it

Read more

Summary

Introduction

Despite the interactive nature of many of these applications, speech output remains to be rather static, typically reading out pre-defined texts or often responding with an awkward delay. A special feature of synthetic speech is its “fluency”, i.e., it does not contain the hesitations, reformulations, or filled pauses typical in human spontaneous speech production. Speech output, once generated, is produced in a single, non-interrupted fashion. The study we are presenting in this paper rests on the assumption that this is suboptimal for many human–machine interactions where listeners need to process information that is synthetically generated, and where a human speaker would try to deliver the information in a way which is suited to the Multimodal Technologies and Interact. For the most influential descriptive work on disfluencies in general, see [11]

Objectives
Methods
Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call