Learning to Select Question-Relevant Relations for Visual Question Answering

Jaewoong Lee,Hwanhee Lee,Kyomin Jung,Heejoon Lee

doi:10.18653/v1/2021.maiworkshop-1.13

Abstract

Previous existing visual question answering (VQA) systems commonly use graph neural networks(GNNs) to extract visual relationships such as semantic relations or spatial relations. However, studies that use GNNs typically ignore the importance of each relation and simply concatenate outputs from multiple relation encoders. In this paper, we propose a novel layer architecture that fuses multiple visual relations through an attention mechanism to address this issue. Specifically, we develop a model that uses question embedding and joint embedding of the encoders to obtain dynamic attention weights with regard to the type of questions. Using the learnable attention weights, the proposed model can efficiently use the necessary visual relation features for a given question. Experimental results on the VQA 2.0 dataset demonstrate that the proposed model outperforms existing graph attention network-based architectures. Additionally, we visualize the attention weight and show that the proposed model assigns a higher weight to relations that are more relevant to the question.

Highlights

VQA is a task that aims to output an answer for a given question related to a given image
We propose a novel attention-based VQA model to solve visual question answering tasks
We orgato achieve the accuracy produced by ReGAT by nize them into three columns where each column

Summary

Introduction

VQA (visual question answering) is a task that aims to output an answer for a given question related to a given image. ReGAT constructs GNN-based relation encoders for each relation and combines the output probability distributions from the encoders using fixed weights to make the final prediction. This process can be problematic because the importance of each relationship for the given question cannot be considered. We train all relation encoders concurrently and learn adaptive weights to form a combined joint representation Using these attention weights, the proposed model assigns higher weights to the relations that are meaningful for a given question.

Visual Question Answering

Graph Attention Layer

Encoder

Findings

Datasets

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Learning to Select Question-Relevant Relations for Visual Question Answering

Abstract

Highlights

Summary

Talk to us

Similar Papers

Lead the way for us

Publication Date: Jan 1, 2021
Citations: 1	License type: cc-by

Similar Papers

Dynamic Co-attention Network for Visual Question Answering
Doaa B Ebaid ... Magda M Madbouly
-
Doaa B Ebaid, et. al.Doaa B Ebaid ... Magda M Madbouly
26 Nov 2021
26 Nov 2021

LRB-Net: Improving VQA via division of labor strategy and multimodal classifiers
Jiangfan Feng ... Ruiguo Liu
Displays | VOL. 75
Jiangfan Feng, et. al.Jiangfan Feng ... Ruiguo Liu
28 Oct 2022
Displays | VOL. 75

Visual question answering model based on graph neural network and contextual attention
Himanshu Sharma ... Anand Singh Jalal
Image and Vision Computing | VOL. 110
Himanshu Sharma, et. al.Himanshu Sharma ... Anand Singh Jalal
29 Mar 2021
Image and Vision Computing | VOL. 110

An effective spatial relational reasoning networks for visual question answering.
Xiang Shen ... Gaofeng Luo
PLOS ONE | VOL. 17
Xiang Shen, et. al.Xiang Shen ... Gaofeng Luo
28 Nov 2022
PLOS ONE | VOL. 17

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Learning to Select Question-Relevant Relations for Visual Question Answering

Abstract

Highlights

Summary

Talk to us

Similar Papers