Abstract

VQA (Visual Question Answering) is a multi-model task. Given a picture and a question related to the image, it will determine the correct answer. The attention mechanism has become a de facto component of almost all VQA models. Most recent VQA approaches use dot-product to calculate the intra-modality and inter-modality attention between visual and language features. In this paper, the BAN (Bilinear Attention Network) method was used to calculate attention. We propose a deep multimodality bilinear attention network (DMBA-NET) framework with two basic attention units (BAN-GA and BAN-SA) to construct inter-modality and intra-modality relations. The two basic attention units are the core of the whole network framework and can be cascaded in depth. In addition, we encode the question based on the dynamic word vector of BERT(Bidirectional Encoder Representations from Transformers), then use self-attention to process the question features further. Then we sum them with the features obtained by BAN-GA and BAN-SA before the final classification. Without using the Visual Genome datasets for augmentation, the accuracy of our model reaches 70.85% on the test-std dataset of VQA 2.0.

Highlights

  • The task goal of VQA (Visual Question Answer) [1] is to build a question answering system like human intelligence, which can recognize the category, spatial relationship, and other information of objects from the specified pictures

  • We added the BERT model based on the first line of the experiment and obtained a 1.6% improvement, which proves that the dynamic word vector can improve the model’s text representation ability

  • Through the visualization of other amount questions, we find that the features of the BAN-GA in the last layer will have a large weight in one column

Read more

Summary

Introduction

The task goal of VQA (Visual Question Answer) [1] is to build a question answering system like human intelligence, which can recognize the category, spatial relationship, and other information of objects from the specified pictures. Aderson et al [9] proposed a bottom-up and top-down attention mechanism and won the VQA Challenge 2017 They use the concatenated attention mechanism to get the image attention guided by the question. Our work attempts to use bilinear attention to construct inter-modality and intra-modality relations between visual and language features. We use a self-attention unit to process the question features further. On this foundation, our model achieves better performance. Compared with MCAN, our model uses a bilinear attention network instead of the more customary one based on dot-products. Framework with two basic attention units (BAN-GA and BAN-SA) to construct intermodality and intra-modality relations between visual and language features. Extensive ablation experiments are carried out, and the experimental results show that each module in the model can play its effectiveness

Attention
High-Level Attributes and Knowledge
VQA Pre-Training
Feature Fusion
Deep Modular Bilinear Attention Network
Question and Image Encoding
Multi-Glimpse Bilinear Guided-Attention Network
Multi-Glimpse Bilinear Self-Attention Network
Multi-Head Self-Attention
Feature Fusion and Answer Prediction
Datasets
Experimental Setup
Ablation Analysis
Qualitative Analysis
Method
Findings
Conclusions
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call