Abstract

Visual Question Answering (VQA) is a multimodal task involving Computer Vision (CV) and Natural Language Processing (NLP), the goal is to establish a high-efficiency VQA model. Learning a fine-grained and simultaneous understanding of both the visual content of images and the textual content of questions is the heart of VQA. In this paper, a novel Multimodal Encoder-Decoder Attention Networks (MEDAN) is proposed. The MEDAN consists of Multimodal Encoder-Decoder Attention (MEDA) layers cascaded in depth, and can capture rich and reasonable question features and image features by associating keywords in question with important object regions in image. Each MEDA layer contains an Encoder module modeling the self-attention of questions, as well as a Decoder module modeling the question-guided-attention and self-attention of images. Experimental evaluation results on the benchmark VQA-v2 dataset demonstrate that MEDAN achieves state-of-the-art VQA performance. With the Adam solver, our best single model delivers 71.01% overall accuracy on the test-std set, and with the AdamW solver, we achieve an overall accuracy of 70.76% on the test-dev set. Additionally, extensive ablation studies are conducted to explore the reasons for MEDAN’s effectiveness.

Highlights

  • Visual Question Answer is a multimodal learning task that aims at automatically answering a natural language question related to the contents of a given picture

  • Networks (MEDAN) is proposed, which consists of Multimodal Encoder-Decoder Attention (MEDA) layers cascaded in depth, and can capture rich and reasonable question features and image features by associating keywords in question with important object regions in image

  • We evaluate the performance of different N ∈ {1, 2, 4, 6, 8}, and the results demonstrate that increasing the number of layers before N = 6 can steadily improve the performance of our proposed Multimodal Encoder-Decoder Attention Networks (MEDAN)

Read more

Summary

INTRODUCTION

Visual Question Answer is a multimodal learning task that aims at automatically answering a natural language question related to the contents of a given picture. Most existing VQA methods focus on learning fine-grained question features and image features to obtain richer multimodal feature representations. Region, which greatly improved VQA performance by achieving complete interaction Both models can be stacked in depth. The current multimodal co-attention models have achieved good performance, we still found that when learning the fine-grained features of the image region, learning question-guided-attention features firstly is different from learning self-attention features firstly, which can obtain better image region representations. Evaluation results on the benchmark VQA-v2 [40] dataset show that our proposed MEDAN model achieves the state-of-the-art performance of VQA. Networks (MEDAN) is proposed, which consists of Multimodal Encoder-Decoder Attention (MEDA) layers cascaded in depth, and can capture rich and reasonable question features and image features by associating keywords in question with important object regions in image. The remaining organizational structure of this paper is as follows: the second part introduces the research work related to VQA; the third part introduces the overall framework research and design of MEDAN; the fourth part verifies the effectiveness of MEDAN through some experiments; take a summary

RELATED WORK
MULTIMODAL ENCODER-DECODER ATTENTION
IMAGE-QUESTION FEATURE FUSION AND
EXPERIMENTS
DATASETS
EXPERIMENTAL DETAILS
Findings
CONCLUSION
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call