Multi-level Visual Feature Enhancement Method for Visual Question Answering

Xingang Wang,Honglu Cheng,Xiaoyu Liu,Xiaomin Li,Jinan Cui

doi:10.1007/978-981-99-1642-9_40

Abstract

Visual question answering is a multimodal task that interacts a given image with the corresponding natural language question to get the final answer. Traditional visual question answer models use region-based top-down image feature representations. This approach causes regional features to lose their contextual connection to global features, resulting in the underutilization of the global semantic features of visual features. To solve this problem, it is necessary to enhance the relationships between regions and between regions and the global to obtain more accurate visual feature representations, which can better correlate with corresponding question texts. Therefore, this paper proposes a multi-level visual feature enhancement method (MLVE). It mainly consists of the separated visual feature representation module (SVFR) and the joint visual feature representation module (JVFR). The graph attention neural network is an important part of the two modules to enhance the relationship between regions and between regions and the global. These two modules can learn different levels of visual semantic relationships to provide richer visual feature representations. The effectiveness of this scheme is verified on the VQA2.0 dataset.

Full Text