Textual-Visual Reference-Aware Attention Network for Visual Dialog

Dan Guo,Shuhui Wang,Meng Wang,Hui Wang

doi:10.1109/tip.2020.2992888

Abstract

Visual dialog is a challenging task in multimedia understanding, which requires the dialog agent to answer a series of questions that are based on an input image. The critical issue to produce an exact answer is how to model the mutual semantic interaction among feature representations of the image, question-answer history, and current question. In this study, we propose a textual-visual Reference-Aware Attention Network (RAA-Net), which aims to effectively fuse Q (question), H (history), Vl (local vision), and Vg (global vision) to infer the exact answer. In the multimodal feature flows, RAA-Net first learns the textual context through multi-head attention between Q and H and then guides the textual reference semantics to the image to capture visual reference semantics by self-and cross-reference-aware attention in and between Vl and Vg. In the proposed RAA-Net, we exploit the two-stage (intraand inter-) visual reasoning mechanism on Vl and Vg. Extensive experiments on the VisDial v0.9 and v1.0 datasets show that RAA-Net achieves state-of-the-art performance. Visualization results on both visual and textual attention maps further validate the remarkable interpretability achieved by our solution.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Textual-Visual Reference-Aware Attention Network for Visual Dialog

Abstract

Talk to us

Similar Papers

More From: IEEE Transactions on Image Processing

Lead the way for us

Journal: IEEE Transactions on Image Processing	Publication Date: Jan 1, 2020
Citations: 53

Similar Papers

Dual Visual Attention Network for Visual Dialog
Dan Guo ... Hui Wang
-
Dan Guo, et. al.Dan Guo ... Hui Wang
01 Aug 2019
01 Aug 2019

GoG: Relation-aware Graph-over-Graph Network for Visual Dialog
Feilong Chen ... Xiuyi Chen
-
Feilong Chen, et. al.Feilong Chen ... Xiuyi Chen
01 Jan 2020
01 Jan 2020

A fine-grained deconfounding study for knowledge-based visual dialog
An-An Liu ... Ning Xu
Visual Informatics | VOL. -
An-An Liu, et. al.An-An Liu ... Ning Xu
01 Oct 2024
Visual Informatics | VOL. -

Visual Dialog with Multi-turn Attentional Memory Network
Dejiang Kong ... Fei Wu
-
Dejiang Kong, et. al.Dejiang Kong ... Fei Wu
01 Jan 2018
01 Jan 2018

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Textual-Visual Reference-Aware Attention Network for Visual Dialog

Abstract

Talk to us

Similar Papers

More From: IEEE Transactions on Image Processing