Abstract

Current vision and language understanding tasks can be generally divided into two categories: image (video) description and visual question answering. Both of them aim to train a model that either straightforwardly generates a description or answers the predefined questions based on an image or a sequence of images from a spectators perspective. However, a large proportion of real-world human interactions also involve spontaneous dialogue exchanges among multiple speakers as well as dynamic visual-textual context, which requires an AI agent to hold a natural and open-ended dialog with humans in a first-person manner based on both visual and textual context. To move closer towards achieving such a spontaneous multimodal conversation, we introduce a new visual dialogue generation dataset (VisDG) based on keyframes and corresponding subtitles extracted from Friends - an American sitcom television series. Specifically, given a start frame and its corresponding dialogue text, the agent has to generate both a meaningful textual response as well as a correct image candidate for the latter part of the dialogue turn. Furthermore, we also propose an end-to-end image-text synergistic network (ITSN) for the task, which outperforms several sophisticated baselines on the proposed VisDG.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.