VisDG: Towards Spontaneous Visual Dialogue Generation

Qichuan Yang,Liuxin Zhang,Jinghua Gao,Yang Zhang

doi:10.1109/ijcnn52387.2021.9534326

Abstract

Current vision and language understanding tasks can be generally divided into two categories: image (video) description and visual question answering. Both of them aim to train a model that either straightforwardly generates a description or answers the predefined questions based on an image or a sequence of images from a spectators perspective. However, a large proportion of real-world human interactions also involve spontaneous dialogue exchanges among multiple speakers as well as dynamic visual-textual context, which requires an AI agent to hold a natural and open-ended dialog with humans in a first-person manner based on both visual and textual context. To move closer towards achieving such a spontaneous multimodal conversation, we introduce a new visual dialogue generation dataset (VisDG) based on keyframes and corresponding subtitles extracted from Friends - an American sitcom television series. Specifically, given a start frame and its corresponding dialogue text, the agent has to generate both a meaningful textual response as well as a correct image candidate for the latter part of the dialogue turn. Furthermore, we also propose an end-to-end image-text synergistic network (ITSN) for the task, which outperforms several sophisticated baselines on the proposed VisDG.

Full Text