Deep Modular Bilinear Attention Network for Visual Question Answering.

Feng Yan,Yanbing Li,Wushouer Silamu

doi:10.3390/s22031045

Abstract

VQA (Visual Question Answering) is a multi-model task. Given a picture and a question related to the image, it will determine the correct answer. The attention mechanism has become a de facto component of almost all VQA models. Most recent VQA approaches use dot-product to calculate the intra-modality and inter-modality attention between visual and language features. In this paper, the BAN (Bilinear Attention Network) method was used to calculate attention. We propose a deep multimodality bilinear attention network (DMBA-NET) framework with two basic attention units (BAN-GA and BAN-SA) to construct inter-modality and intra-modality relations. The two basic attention units are the core of the whole network framework and can be cascaded in depth. In addition, we encode the question based on the dynamic word vector of BERT(Bidirectional Encoder Representations from Transformers), then use self-attention to process the question features further. Then we sum them with the features obtained by BAN-GA and BAN-SA before the final classification. Without using the Visual Genome datasets for augmentation, the accuracy of our model reaches 70.85% on the test-std dataset of VQA 2.0.

Highlights

The task goal of VQA (Visual Question Answer) [1] is to build a question answering system like human intelligence, which can recognize the category, spatial relationship, and other information of objects from the specified pictures
We added the BERT model based on the first line of the experiment and obtained a 1.6% improvement, which proves that the dynamic word vector can improve the model’s text representation ability
Through the visualization of other amount questions, we find that the features of the BAN-GA in the last layer will have a large weight in one column

Summary

Introduction

The task goal of VQA (Visual Question Answer) [1] is to build a question answering system like human intelligence, which can recognize the category, spatial relationship, and other information of objects from the specified pictures. Aderson et al [9] proposed a bottom-up and top-down attention mechanism and won the VQA Challenge 2017 They use the concatenated attention mechanism to get the image attention guided by the question. Our work attempts to use bilinear attention to construct inter-modality and intra-modality relations between visual and language features. We use a self-attention unit to process the question features further. On this foundation, our model achieves better performance. Compared with MCAN, our model uses a bilinear attention network instead of the more customary one based on dot-products. Framework with two basic attention units (BAN-GA and BAN-SA) to construct intermodality and intra-modality relations between visual and language features. Extensive ablation experiments are carried out, and the experimental results show that each module in the model can play its effectiveness

Attention

High-Level Attributes and Knowledge

VQA Pre-Training

Feature Fusion

Deep Modular Bilinear Attention Network

Question and Image Encoding

Multi-Glimpse Bilinear Guided-Attention Network

Multi-Glimpse Bilinear Self-Attention Network

Multi-Head Self-Attention

Feature Fusion and Answer Prediction

Datasets

Experimental Setup

Ablation Analysis

Qualitative Analysis

Method

Findings

Conclusions

Full Text

Published version (

Free)

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Sensors (Basel, Switzerland)	Publication Date: Jan 28, 2022
Citations: 7	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Deep Modular Bilinear Attention Network for Visual Question Answering.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Sensors (Basel, Switzerland)

Lead the way for us

Similar Papers

Can Pre-training help VQA with Lexical Variations?
Shailza Jolly ... Shubham Kapoor
-
Shailza Jolly, et. al.Shailza Jolly ... Shubham Kapoor
01 Jan 2020
01 Jan 2020

Feature Enhancement in Attention for Visual Question Answering
Yuetan Lin ... Zhangyang Pang
-
Yuetan Lin, et. al.Yuetan Lin ... Zhangyang Pang
01 Jul 2018
01 Jul 2018

Knowledge-Routed Visual Question Reasoning: Challenges for Deep Representation Embedding.
Qingxing Cao ... Liang Lin
IEEE transactions on neural networks | VOL. 33
Qingxing Cao, et. al.Qingxing Cao ... Liang Lin
01 Jan 2020
IEEE transactions on neural networks | VOL. 33

Don't Just Assume; Look and Answer: Overcoming Priors for Visual Question Answering
Aishwarya Agrawal ... Devi Parikh
-
Aishwarya Agrawal, et. al.Aishwarya Agrawal ... Devi Parikh
01 Jun 2018
01 Jun 2018

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Deep Modular Bilinear Attention Network for Visual Question Answering.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Sensors (Basel, Switzerland)