Abstract

In real-world dialogue, first-person visual information about where the other speakers are and what they are paying attention to is crucial to understand their intentions. Non-verbal responses also play an important role in social interactions. In this paper, we propose a visually-grounded first-person dialogue (VFD) dataset with verbal and non-verbal responses. The VFD dataset provides manually annotated (1) first-person images of agents, (2) utterances of human speakers, (3) eye-gaze locations of the speakers, and (4) the agents' verbal and non-verbal responses. We present experimental results obtained using the proposed VFD dataset and recent neural network models (e.g., BERT, ResNet). The results demonstrate that first-person vision helps neural network models correctly understand human intentions, and the production of non-verbal responses is a challenging task like that of verbal responses. Our dataset is publicly available.

Highlights

  • U: これのLはないのかしら V: 同じ服がたくさんあるからどれかはLじゃないかな N: 同じ服のサイズをチェックする U: I wonder if there is an L for this

  • In real-world dialogue scenarios, first-person visual information about where the human speaker is and what they are paying attention to is crucial for agents to understand human intentions

  • Without the first-person image, it is difficult for the agent to recognize that the pronoun “this” in the human utterance (U) refers to the article of yellow clothing rather than any other products

Read more

Summary

Introduction

U: これのLはないのかしら V: 同じ服がたくさんあるからどれかはLじゃないかな N: 同じ服のサイズをチェックする U: I wonder if there is an L for this. N: Check out the same clothing size. Huber et al (2018) developed an image-grounded conversational agent that uses visual sentiment, facial expression, and scene features, and Mostafazadeh et al (2017) constructed the publicly available IGC dataset, which comprises image-grounded conversations. These studies and resources have been shown to be useful, there are currently two limitations. In real-world dialogue scenarios, first-person visual information about where the human speaker is and what they are paying attention to is crucial for agents to understand human intentions. Without the first-person image, it is difficult for the agent to recognize that the pronoun “this” in the human utterance (U) refers to the article of yellow clothing rather than any other products (e.g., brown clothes)

Objectives
Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call