Multi-Modal fusion with multi-level attention for Visual Dialog

Jingping Zhang,Qiang Wang,Yahong Han

doi:10.1016/j.ipm.2019.102152

Abstract

Given an input image, Visual Dialog is introduced to answer a sequence of questions in the form of a dialog. To generate accurate answers for questions in the dialog, we need to consider all information of the dialog history, the question, and the image. However, existing methods usually directly utilized the high-level semantic information of the whole sentence for the dialog history and the question, while ignoring the low-level detailed information of words in the sentence. Similarly, the detailed region information of the image in low level is also required to be considered for question answering. Therefore, we propose a novel visual dialog method, which focuses on both high-level and low-level information of the dialog history, the question, and the image. In our approach, we introduce three low-level attention modules, the goal of which is to enhance the representation of words in the sentence of the dialog history and the question based on the word-to-word connection and enrich the region information of the image based on the region-to-region relation. Besides, we design three high-level attention modules to select important words in the sentence of the dialog history and the question as the supplement of the detailed information for semantic understanding, as well as to select relevant regions in the image to provide the targeted visual information for question answering. We evaluate the proposed approach on two datasets: VisDial v0.9 and VisDial v1.0. The experimental results demonstrate that utilizing both low-level and high-level information really enhances the representation of inputs.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Multi-Modal fusion with multi-level attention for Visual Dialog

Abstract

Talk to us

Similar Papers

More From: Information Processing and Management

Lead the way for us

Journal: Information Processing and Management	Publication Date: Nov 11, 2019
Citations: 12

Similar Papers

Adaptive Attention-based High-level Semantic Introduction for Image Caption
Xiaoxiao Liu ... Qingyang Xu
ACM Transactions on Multimedia Computing, Communications, and Applications | VOL. 16
Xiaoxiao Liu, et. al.Xiaoxiao Liu ... Qingyang Xu
30 Nov 2020
ACM Transactions on Multimedia Computing, Communications, and Applications | VOL. 16

GoG: Relation-aware Graph-over-Graph Network for Visual Dialog
Feilong Chen ... Xiuyi Chen
-
Feilong Chen, et. al.Feilong Chen ... Xiuyi Chen
01 Jan 2020
01 Jan 2020

An image retrieval technique based on texture features using semantic properties
K.P Jisha ... Bella Mary I Thusnavis
-
K.P Jisha, et. al.K.P Jisha ... Bella Mary I Thusnavis
01 Feb 2013
01 Feb 2013

Hierarchical Feature Fusion Network for Salient Object Detection.
Xuelong Li ... Yongsheng Dong
IEEE transactions on image processing : a publication of the IEEE Signal Processing Society | VOL. PP
Xuelong Li, et. al.Xuelong Li ... Yongsheng Dong
01 Jan 2020
IEEE transactions on image processing : a publication of the IEEE Signal Processing Society | VOL. PP

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Multi-Modal fusion with multi-level attention for Visual Dialog

Abstract

Talk to us

Similar Papers

More From: Information Processing and Management