Abstract

Visual question answering (VQA) is a crucial yet challenging task in multimodal understanding. To correctly answer questions about an image, VQA models are required to comprehend the fine-grained semantics of both the image and the question. Recent advances have shown that both grid and region features contribute to improving the VQA performance, while grid features surprisingly outperform region features. However, grid features will inevitably induce visual semantic noise due to fine granularity. Besides, the ignorance of geometric relationship makes VQA models difficult to understand the object relative positions in the image and answer questions accurately. In this paper, we propose a visual enhancement network for VQA that leverages region features and position information to enhance grid features, thus generating richer visual grid semantics. First, the grid enhancement multi-head guided-attention module utilizes regions around the grid to provide visual context, forming rich visual grid semantics and effectively compensating for the fine granularity of the grid. Second, a novel geometric perception multi-head self-attention is introduced to process two types of features, incorporating geometric relations such as relative direction between objects while exploring internal semantic interactions. Extensive experiments demonstrate that the proposed method can obtain competitive results over other strong baselines.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call