Understanding multimodal information is the key to visual question answering (VQA) tasks. Most existing approaches use attention mechanisms to acquire fine-grained information understanding. However, these approaches with merely attention mechanisms do not solve the potential understanding bias problem. Hence, this paper introduces contextual information into VQA for the first time and presents a context-aware attention network (CAAN) to tackle the case. By improving the modular co-attention network (MCAN) framework, CAAN’s main work includes: designing a novel absolute position calculation method based on the coordinates of each image region in the image and the image’s actual size, the position information of all image regions are integrated as contextual information to enhance the visual representation; based on the question itself, several internal contextual information representations are introduced to participate in the modeling of the question words, solving the understanding bias caused by the similarity of the question. Additionally, we also designed two models of different scales, namely CAAN-base and CAAN-large, to explore the effect of the field of view on interaction. Finally, extensive experimental results show that CAAN significantly outperforms MCAN and achieves comparable or even better performance than other state-of-the-art approaches, proving our method can tackle the understanding bias.
Read full abstract