Abstract

Visual dialogue is a challenging task since it needs to answer a series of coherent questions on the basis of understanding the visual environment. Previous studies focus on the implicit exploration of multimodal co-reference by implicitly attending to spatial image features or object-level image features but neglect the importance of locating the objects explicitly in the visual content, which is associated with entities in the textual content. Therefore, in this paper we propose a {\bf M}ultimodal {\bf I}ncremental {\bf T}ransformer with {\bf V}isual {\bf G}rounding, named MITVG, which consists of two key parts: visual grounding and multimodal incremental transformer. Visual grounding aims to explicitly locate related objects in the image guided by textual entities, which helps the model exclude the visual content that does not need attention. On the basis of visual grounding, the multimodal incremental transformer encodes the multi-turn dialogue history combined with visual scene step by step according to the order of the dialogue and then generates a contextually and visually coherent response. Experimental results on the VisDial v0.9 and v1.0 datasets demonstrate the superiority of the proposed model, which achieves comparable performance.

Highlights

  • There is increasing interest in visionlanguage tasks, such as image caption (Xu et al, 2015; Anderson et al, 2016, 2018; Cornia et al, 2020) and visual question answering (Ren et al, 2015a; Gao et al, 2015; Lu et al, 2016; Anderson et al, 2018)

  • We propose a novel multimodal incremental transformer to encode the multi-turn dialogue history step by step combined with the visual content and generate a contextually and visually coherent response

  • The improvement of R@10 is the largest and our method gains a large increase on mean reciprocal rank (MRR) and R@1 due to the explicit modeling of multiple modalities (Seeing Sec 3.5 for further quantitative analysis)

Read more

Summary

Introduction

There is increasing interest in visionlanguage tasks, such as image caption (Xu et al, 2015; Anderson et al, 2016, 2018; Cornia et al, 2020) and visual question answering (Ren et al, 2015a; Gao et al, 2015; Lu et al, 2016; Anderson et al, 2018). As an extension of conventional single-turn visual question answering, Das et al (2017) introduce a multi-turn visual question answering task named visual dialogue, which aims to Caption: there is a frisbee team with their coach taking a team photo. Explore the ability of an AI agent to hold a meaningful multi-turn dialogue with humans in natural language about visual content. Some fusion-based models (Das et al, 2017) are proposed to fuse spatial image features and textual features in order to obtain a joint representation. Attention-based models (Lu et al, 2017; Wu et al, 2018; Kottur et al, 2018) are proposed to dynamically attend to spatial image features in order to find related visual content.

Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.