A Visually-grounded First-person Dialogue Dataset with Verbal and Non-verbal Responses

Takashi Miyazaki,Hisashi Kamezawa,Nobuyuki Shimizu,Hideki Nakayama,Noriki Nishida

doi:10.18653/v1/2020.emnlp-main.267

Takashi Miyazaki, Hisashi Kamezawa + Show 3 more

Open Access

https://doi.org/10.18653/v1/2020.emnlp-main.267

Copy DOI

Publication Date: Jan 1, 2020
Citations: 27	License type: cc-by

Affiliation: Yahoo (United Kingdom), The University of Tokyo

Abstract

In real-world dialogue, first-person visual information about where the other speakers are and what they are paying attention to is crucial to understand their intentions. Non-verbal responses also play an important role in social interactions. In this paper, we propose a visually-grounded first-person dialogue (VFD) dataset with verbal and non-verbal responses. The VFD dataset provides manually annotated (1) first-person images of agents, (2) utterances of human speakers, (3) eye-gaze locations of the speakers, and (4) the agents' verbal and non-verbal responses. We present experimental results obtained using the proposed VFD dataset and recent neural network models (e.g., BERT, ResNet). The results demonstrate that first-person vision helps neural network models correctly understand human intentions, and the production of non-verbal responses is a challenging task like that of verbal responses. Our dataset is publicly available.

Highlights

U: これのLはないのかしら V: 同じ服がたくさんあるからどれかはLじゃないかな N: 同じ服のサイズをチェックする U: I wonder if there is an L for this
In real-world dialogue scenarios, first-person visual information about where the human speaker is and what they are paying attention to is crucial for agents to understand human intentions
Without the first-person image, it is difficult for the agent to recognize that the pronoun “this” in the human utterance (U) refers to the article of yellow clothing rather than any other products

Summary

Introduction

U: これのLはないのかしら V: 同じ服がたくさんあるからどれかはLじゃないかな N: 同じ服のサイズをチェックする U: I wonder if there is an L for this. N: Check out the same clothing size. Huber et al (2018) developed an image-grounded conversational agent that uses visual sentiment, facial expression, and scene features, and Mostafazadeh et al (2017) constructed the publicly available IGC dataset, which comprises image-grounded conversations. These studies and resources have been shown to be useful, there are currently two limitations. In real-world dialogue scenarios, first-person visual information about where the human speaker is and what they are paying attention to is crucial for agents to understand human intentions. Without the first-person image, it is difficult for the agent to recognize that the pronoun “this” in the human utterance (U) refers to the article of yellow clothing rather than any other products (e.g., brown clothes)

Objectives

Methods

Results

Conclusion