Abstract
VQA (Visual Question Answering) is a multi-model task. Given a picture and a question related to the image, it will determine the correct answer. The attention mechanism has become a de facto component of almost all VQA models. Most recent VQA approaches use dot-product to calculate the intra-modality and inter-modality attention between visual and language features. In this paper, the BAN (Bilinear Attention Network) method was used to calculate attention. We propose a deep multimodality bilinear attention network (DMBA-NET) framework with two basic attention units (BAN-GA and BAN-SA) to construct inter-modality and intra-modality relations. The two basic attention units are the core of the whole network framework and can be cascaded in depth. In addition, we encode the question based on the dynamic word vector of BERT(Bidirectional Encoder Representations from Transformers), then use self-attention to process the question features further. Then we sum them with the features obtained by BAN-GA and BAN-SA before the final classification. Without using the Visual Genome datasets for augmentation, the accuracy of our model reaches 70.85% on the test-std dataset of VQA 2.0.
Highlights
The task goal of VQA (Visual Question Answer) [1] is to build a question answering system like human intelligence, which can recognize the category, spatial relationship, and other information of objects from the specified pictures
We added the BERT model based on the first line of the experiment and obtained a 1.6% improvement, which proves that the dynamic word vector can improve the model’s text representation ability
Through the visualization of other amount questions, we find that the features of the BAN-GA in the last layer will have a large weight in one column
Summary
The task goal of VQA (Visual Question Answer) [1] is to build a question answering system like human intelligence, which can recognize the category, spatial relationship, and other information of objects from the specified pictures. Aderson et al [9] proposed a bottom-up and top-down attention mechanism and won the VQA Challenge 2017 They use the concatenated attention mechanism to get the image attention guided by the question. Our work attempts to use bilinear attention to construct inter-modality and intra-modality relations between visual and language features. We use a self-attention unit to process the question features further. On this foundation, our model achieves better performance. Compared with MCAN, our model uses a bilinear attention network instead of the more customary one based on dot-products. Framework with two basic attention units (BAN-GA and BAN-SA) to construct intermodality and intra-modality relations between visual and language features. Extensive ablation experiments are carried out, and the experimental results show that each module in the model can play its effectiveness
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have