Dynamic Co-attention Network for Visual Question Answering

Doaa B Ebaid,Adel A El-Zoghabi,Magda M Madbouly

doi:10.1109/iscmi53840.2021.9654812

Abstract

Visual Question Answering (VQA) embraces driving answers to questions about a particular image. It requires a fine-grained understanding of both the image content and the question content. Recently, VQA models use attention mechanism to find appropriate visual features based on the given question, this requires multi-step inference. The majority of the present work focus on the visual attention only and ignores the role of the textual attention in VQA. In this paper, we propose a dynamic capsule co-attention (CapsCoAtt) where the visual and textual features are treated as capsules and the attention weights are obtained through an iterative process inspired by the capsule network (CapsNet). In addition to that, to achieve a deep understanding for the questions, we propose a hierarchal question representation through three levels. We evaluate the proposed model on the benchmark dataset VQA 2.0. The results show a significant improvement in the VQA performance with the lowest number of parameters compared with the baseline models.

Full Text