Multimodal deep fusion for image question answering

Weifeng Zhang,Jing Yu,Yuxia Wang,Wei Wang

doi:10.1016/j.knosys.2020.106639

Abstract

Multimodal fusion plays a key role in Image Question Answering (IQA). However, most of the current algorithms are insufficient to fuse multiple relations implied in multimodalities which are vital for predicting correct answers. In this paper, we design an effective Multimodal Deep Fusion Network (MDFNet) to achieve fine-grained multimodal fusion. Specifically, we propose Graph Reasoning and Fusion Layer (GRFL) to reason complex spatial and semantic relations between visual objects and fuse these two kinds of relations adaptively. This fusion strategy allows different relations make different contribution guided by the reasoning step. Then a Multimodal Deep Fusion Network is built based on stacking several GRFLs, to achieve sufficient multimodal fusion. Quantitative and qualitative experiments conducted on popular benchmarks including VQA v2 and GQA reveal the effectiveness of DMFNet. Our best single model achieves 71.19% overall accuracy on VQA v2 dataset, and 57.05% accuracy on GQA dataset.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Multimodal deep fusion for image question answering

Abstract

Talk to us

Similar Papers

More From: Knowledge-Based Systems

Lead the way for us

Journal: Knowledge-Based Systems	Publication Date: Nov 28, 2020
Citations: 21

Similar Papers

DMRFNet: Deep Multimodal Reasoning and Fusion for Visual Question Answering and explanation generation
Weifeng Zhang ... Chuan Ran
Information Fusion | VOL. 72
Weifeng Zhang, et. al.Weifeng Zhang ... Chuan Ran
12 Feb 2021
Information Fusion | VOL. 72

Towards Forecasting the Onset of Cybersickness by Fusing Physiological, Head-tracking and Eye-tracking with Multimodal Deep Fusion Network
Rifatul Islam ... Kevin Desai
-
Rifatul Islam, et. al.Rifatul Islam ... Kevin Desai
01 Oct 2022
01 Oct 2022

Advancing classroom fatigue recognition: A multimodal fusion approach using self-attention mechanism
Lei Cao ... Chunjiang Fan
Biomedical Signal Processing and Control | VOL. 89
Lei Cao, et. al.Lei Cao ... Chunjiang Fan
17 Nov 2023
Biomedical Signal Processing and Control | VOL. 89

Multi-modal bioelectrical signal fusion analysis based on different acquisition devices and scene settings: Overview, challenges, and novel orientation
Jingjing Li ... Qiang Wang
Information Fusion | VOL. 79
Jingjing Li, et. al.Jingjing Li ... Qiang Wang
06 Nov 2021
Information Fusion | VOL. 79

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Multimodal deep fusion for image question answering

Abstract

Talk to us

Similar Papers

More From: Knowledge-Based Systems