Multimodal Incremental Transformer with Visual Grounding for Visual Dialogue Generation

Feilong Chen,Xiuyi Chen,Jie Zhou,Peng Li,Fandong Meng

doi:10.18653/v1/2021.findings-acl.38

Abstract

Visual dialogue is a challenging task since it needs to answer a series of coherent questions on the basis of understanding the visual environment. Previous studies focus on the implicit exploration of multimodal co-reference by implicitly attending to spatial image features or object-level image features but neglect the importance of locating the objects explicitly in the visual content, which is associated with entities in the textual content. Therefore, in this paper we propose a {\bf M}ultimodal {\bf I}ncremental {\bf T}ransformer with {\bf V}isual {\bf G}rounding, named MITVG, which consists of two key parts: visual grounding and multimodal incremental transformer. Visual grounding aims to explicitly locate related objects in the image guided by textual entities, which helps the model exclude the visual content that does not need attention. On the basis of visual grounding, the multimodal incremental transformer encodes the multi-turn dialogue history combined with visual scene step by step according to the order of the dialogue and then generates a contextually and visually coherent response. Experimental results on the VisDial v0.9 and v1.0 datasets demonstrate the superiority of the proposed model, which achieves comparable performance.

Highlights

There is increasing interest in visionlanguage tasks, such as image caption (Xu et al, 2015; Anderson et al, 2016, 2018; Cornia et al, 2020) and visual question answering (Ren et al, 2015a; Gao et al, 2015; Lu et al, 2016; Anderson et al, 2018)
We propose a novel multimodal incremental transformer to encode the multi-turn dialogue history step by step combined with the visual content and generate a contextually and visually coherent response
The improvement of R@10 is the largest and our method gains a large increase on mean reciprocal rank (MRR) and R@1 due to the explicit modeling of multiple modalities (Seeing Sec 3.5 for further quantitative analysis)

Summary

Introduction

There is increasing interest in visionlanguage tasks, such as image caption (Xu et al, 2015; Anderson et al, 2016, 2018; Cornia et al, 2020) and visual question answering (Ren et al, 2015a; Gao et al, 2015; Lu et al, 2016; Anderson et al, 2018). As an extension of conventional single-turn visual question answering, Das et al (2017) introduce a multi-turn visual question answering task named visual dialogue, which aims to Caption: there is a frisbee team with their coach taking a team photo. Explore the ability of an AI agent to hold a meaningful multi-turn dialogue with humans in natural language about visual content. Some fusion-based models (Das et al, 2017) are proposed to fuse spatial image features and textual features in order to obtain a joint representation. Attention-based models (Lu et al, 2017; Wu et al, 2018; Kottur et al, 2018) are proposed to dynamically attend to spatial image features in order to find related visual content.

Methods

Results

Conclusion

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Multimodal Incremental Transformer with Visual Grounding for Visual Dialogue Generation

Abstract

Highlights

Summary

Talk to us

Similar Papers

Lead the way for us

Publication Date: Jan 1, 2021
Citations: 10	License type: cc-by

Similar Papers

Dynamic MDETR: A Dynamic Multimodal Transformer Decoder for Visual Grounding.
Fengyuan Shi ... Weilin Huang
IEEE Transactions on Pattern Analysis and Machine Intelligence | VOL. 46
Fengyuan Shi, et. al.Fengyuan Shi ... Weilin Huang
01 Feb 2024
IEEE Transactions on Pattern Analysis and Machine Intelligence | VOL. 46

GoG: Relation-aware Graph-over-Graph Network for Visual Dialog
Feilong Chen ... Fandong Meng
-
Feilong Chen, et. al.Feilong Chen ... Fandong Meng
01 Jan 2020
01 Jan 2020

Visual Dialog.
Abhishek Das ... Jose Moura
IEEE Transactions on Pattern Analysis and Machine Intelligence | VOL. 41
Abhishek Das, et. al.Abhishek Das ... Jose Moura
01 May 2019
IEEE Transactions on Pattern Analysis and Machine Intelligence | VOL. 41

Aligning vision-language for graph inference in visual dialog
Tianling Jiang ... Chunping Liu
Image and Vision Computing | VOL. 116
Tianling Jiang, et. al.Tianling Jiang ... Chunping Liu
12 Oct 2021
Image and Vision Computing | VOL. 116

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Multimodal Incremental Transformer with Visual Grounding for Visual Dialogue Generation

Abstract

Highlights

Summary

Talk to us

Similar Papers