Abstract

Transformer architectures have recently been introduced into the field of visual question answering (VQA), due to their powerful capabilities of information extraction and fusion. However, existing Transformer-like models, including models using a single Transformer structure and large-scale pre-training generic visual-linguistic models, do not fully utilize both positional information of words in questions and positional information of objects in images, which are shown in this paper to be crucial in VQA tasks. To address this challenge, we propose a novel positional attention guided Transformer-like architecture, which can adaptively extracts positional information within and across the visual and language modalities, and use this information to guide high-level interactions in inter- and intra-modality information flows. In particular, we design and assemble three positional attention modules into a single Transformer-like model MCAN. We show that the positional information introduced in intra-modality interaction can adaptively modulate inter-modality interaction according to different inputs, which plays an important role for visual reasoning. Experimental results demonstrate that our model outperforms the state-of-the-art models and is particularly good at handling object counting questions. Overall, our model achieves the accuracy of 70.10%, 71.27%, 71.52% on the datasets of COCO-QA, VQA v1.0 test-std and VQA v2.0 test-std, respectively. The source code will be publicly available at https://github.com/waizei/PositionalMCAN.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.