Turn-alignment using eye-gaze and speech in conversational interaction

Kristiina Jokinen,Masafumi Nishida,Kazuaki Harada,Seiichi Yamamoto

doi:10.21437/interspeech.2010-571

Abstract

Abstract Spoken interactions are known for accurate timing and alignment between interlocutors: turn-taking and topic flow are managed in a manner that provides conversational fluency and smooth progress of the task. This paper studies the relation between the interlocutors’ eye-gaze and spoken utterances, and describes our experiments on turn alignment. We conducted classification experiments by Support Vector Machine on turn-taking using the features for dialogue act, eye-gaze, and speech prosody in conversation data. As a result, we demonstrated that eye-gaze features are important signals in turn management, and seem even more important than speech features when the intention of utterances is clear. Index Terms : eye-gaze, dialogue, interaction, speech analysis, turn-taking 1. Introduction The role of eye-gaze in fluent communication has long since been acknowledged ([2]; [7]). Previous research has established close relations between eye-gaze and conversational feedback ([3]), building trust and rapport, as well as focus of shared attention ([15]). Eye-gaze is also important in turn-taking signalling: usually the interlocutors signal their wish to give the turn by gazing up to the interlocutor, leaning back, and dropping in pitch and loudness, and the partner can, accordingly, start preparing to take the turn. There is evidence that lack of eye contact decreases turn-taking efficiency in video-conferencing ([16]), and that the coupling of speech and gaze streams in a word acquisition task can improve performance significantly ([11]). Several computational models of eye-gaze behaviour for artificial agents have also been designed. For instance, [9] describe an eye-gaze model for believable virtual humans, [13] demonstrate gaze modelling for conversational engage-ment, and [10] built an eye-gaze model to ground information in interactions with embodied conversational agents. Our research focuses on turn-taking and eye-gaze alignment in natural dialogues and especially on the role of eye-gaze as a means to coordinate and control turn-taking. In our previous work [5,6] we noticed that in multi-party dialogues the participants head movement was important in signalling turn-taking, maybe because of its greater visibility than eye-gaze. (This is in agreement with [12], who noticed that in virtual environments, head tracking seems sufficient when people turn their heads to look but if the person is not turning their head to look at an object, then eye-tracking is important to discern the gaze of a person.) The main objective in the current research is to explore the relation between eye-gaze and speech, in particular, how the annotated turn and dialogue features and automatically recognized speech properties affect in turn management. Methodologically our research relies on experimentation and observation: signal-level measurements and analysis of gaze and speech are combined with human-level observation of dialogue events (dialogue acts and turn-taking). We use our three-party dialogue data that is analysed with respect to the interlocutors’ speech, and annotated with dialogue acts, eye-gaze, and turn-taking features [6]. The experiments deal with the classification of turn-taking events using the analysed features and the results show that eye-gaze speech information significantly improves the accuracy compared with the classification with dialogue act information only. However, what is also interesting that the difference between gaze and speech features is not significant, i.e. eye-gaze and speech are important signals in turn management, but their effect is parallel rather than complementary. Moreover, eye-gaze seems to more important than speech when the intention of the utterance is clear. The paper is structured as follows. We first describe the research on turn-taking and the alignment of speech and gaze in Section 2. We then present our data and speech analysis in Section 3, and experimental results as well as discussion concerning their importance in Section 4. Section 5 draws conclusions and points to future research.

Full Text