Abstract

This paper describes a method for estimating the internal state of a user of a spoken dialog system before his/her first input utterance. When actually using a dialog-based system, the user is often perplexed by the prompt. A typical system provides more detailed information to a user who is taking time to make an input utterance, but such assistance is nuisance if the user is merely considering how to answer the prompt. To respond appropriately, the spoken dialog system should be able to consider the user’s internal state before the user’s input. Conventional studies on user modeling have focused on the linguistic information of the utterance for estimating the user’s internal state, but this approach cannot estimate the user’s state until the end of the user’s first utterance. Therefore, we focused on the user’s nonverbal output such as fillers, silence, or head-moving until the beginning of the input utterance. The experimental data was collected on a Wizard of Oz basis, and the labels were decided by five evaluators. Finally, we conducted a discrimination experiment with the trained user model using combined features. As a three-class discrimination result, we obtained about 85% accuracy in an open test.

Highlights

  • Speech is the most basic medium of human-human communication and is expected to be one of the main modalities of more flexible man-machine interaction along with various intuitive interfaces rather than traditional text-based interfaces

  • Many conventional studies on user modeling have focused on the linguistic information of the user’s utterance and estimated the user’s internal states based on the dialog history [8,9,10]

  • We carried out an experiment for discriminating the three classes of the user’s internal state (e.g., State A, B, and C) using the Support Vector Machine (SVM) [36]

Read more

Summary

Introduction

Speech is the most basic medium of human-human communication and is expected to be one of the main modalities of more flexible man-machine interaction along with various intuitive interfaces rather than traditional text-based interfaces. Many conventional studies on user modeling have focused on the linguistic information of the user’s utterance and estimated the user’s internal states based on the dialog history (i.e., previously observed utterances made by the user and the system) [8,9,10]. Buß and Schlangen [18] defined the short utterance segment in continuous speech as a subutterance phenomenon and analyzed the roles of the turn-taking or back channels These works focused on Phase 3 of a session. Edlund and Nordstrand [29] have researched on the multimodal dialog system and examined the turn-taking between the user and an agent (an animated talking head) that made a gesture such as head motion and gaze control Most of these works focused on the turn-taking behavior after the dialogue establishment (i.e., Phase 3).

Collection and Analysis of Dialog Data
Speech-Based Features
Period f p
Vision-Based Features
Findings
Discrimination Experiment
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call