Abstract

Visual Question Answering (VQA) embraces driving answers to questions about a particular image. It requires a fine-grained understanding of both the image content and the question content. Recently, VQA models use attention mechanism to find appropriate visual features based on the given question, this requires multi-step inference. The majority of the present work focus on the visual attention only and ignores the role of the textual attention in VQA. In this paper, we propose a dynamic capsule co-attention (CapsCoAtt) where the visual and textual features are treated as capsules and the attention weights are obtained through an iterative process inspired by the capsule network (CapsNet). In addition to that, to achieve a deep understanding for the questions, we propose a hierarchal question representation through three levels. We evaluate the proposed model on the benchmark dataset VQA 2.0. The results show a significant improvement in the VQA performance with the lowest number of parameters compared with the baseline models.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.