Abstract

Previous existing visual question answering (VQA) systems commonly use graph neural networks(GNNs) to extract visual relationships such as semantic relations or spatial relations. However, studies that use GNNs typically ignore the importance of each relation and simply concatenate outputs from multiple relation encoders. In this paper, we propose a novel layer architecture that fuses multiple visual relations through an attention mechanism to address this issue. Specifically, we develop a model that uses question embedding and joint embedding of the encoders to obtain dynamic attention weights with regard to the type of questions. Using the learnable attention weights, the proposed model can efficiently use the necessary visual relation features for a given question. Experimental results on the VQA 2.0 dataset demonstrate that the proposed model outperforms existing graph attention network-based architectures. Additionally, we visualize the attention weight and show that the proposed model assigns a higher weight to relations that are more relevant to the question.

Highlights

  • VQA is a task that aims to output an answer for a given question related to a given image

  • We propose a novel attention-based VQA model to solve visual question answering tasks

  • We orgato achieve the accuracy produced by ReGAT by nize them into three columns where each column

Read more

Summary

Introduction

VQA (visual question answering) is a task that aims to output an answer for a given question related to a given image. ReGAT constructs GNN-based relation encoders for each relation and combines the output probability distributions from the encoders using fixed weights to make the final prediction. This process can be problematic because the importance of each relationship for the given question cannot be considered. We train all relation encoders concurrently and learn adaptive weights to form a combined joint representation Using these attention weights, the proposed model assigns higher weights to the relations that are meaningful for a given question.

Visual Question Answering
Graph Attention Layer
Encoder
Findings
Datasets

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.