Abstract

Visual question answering (VQA) requires a high-level understanding of both questions and images, along with visual reasoning to predict the correct answer. Therefore, it is important to design an effective attention model to associate key regions in an image with key words in a question. Up to now, most attention-based approaches only model the relationships between individual regions in an image and words in a question. It is not enough to predict the correct answer for VQA, as human beings always think in terms of global information, not only local information. In this paper, we propose a novel multi-modality global fusion attention network (MGFAN) consisting of stacked global fusion attention (GFA) blocks, which can capture information from global perspectives. Our proposed method computes co-attention and self-attention at the same time, rather than computing them individually. We validate our proposed method on the two most commonly used benchmarks, the VQA-v2 datasets. Experimental results show that the proposed method outperforms the previous state-of-the-art. Our best single model achieves 70.67% accuracy on the test-dev set of VQA-v2.

Highlights

  • With the development of deep learning, researchers have made great progress in many computer vision tasks in the last several years, e.g., classification [1,2] and detection [3,4]

  • We propose a novel modality global fusion attention network (MGFAN) for visual question answering (VQA) that can compute attention considering global information

  • In global fusion attention (GFA), we summarize all features into k vectors and compute the attention such that the attended features conclude about the global information

Read more

Summary

Introduction

With the development of deep learning, researchers have made great progress in many computer vision tasks in the last several years, e.g., classification [1,2] and detection [3,4]. VQA is a system that has to answer free-form questions by reasoning about presented images [14] It has many applications in practice, such as assisting vulnerable (and blind) people to access image information and improving human-machine interaction. VQA is a challenging task since it requires a high-level understanding of both questions and images, along with visual reasoning to predict the correct answer. Many methods have been proposed to improve the performance of VQA models. Almost all of these methods are based on an attention mechanism, and they focus on how to adaptively select important features that can help with the correct answer.

Methods
Results
Discussion
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.