Rich Visual Knowledge-Based Augmentation Network for Visual Question Answering

Liyang Zhang,Pengpeng Zeng,Lianli Gao,Jingkuan Song,Xiangpeng Li,Donghao Liu,Shuaicheng Liu

doi:10.1109/tnnls.2020.3017530

Abstract

Visual question answering (VQA) that involves understanding an image and paired questions develops very quickly with the boost of deep learning in relevant research fields, such as natural language processing and computer vision. Existing works highly rely on the knowledge of the data set. However, some questions require more professional cues other than the data set knowledge to answer questions correctly. To address such an issue, we propose a novel framework named a knowledge-based augmentation network (KAN) for VQA. We introduce object-related open-domain knowledge to assist the question answering. Concretely, we extract more visual information from images and introduce a knowledge graph to provide the necessary common sense or experience for the reasoning process. For these two augmented inputs, we design an attention module that can adjust itself according to the specific questions, such that the importance of external knowledge against detected objects can be balanced adaptively. Extensive experiments show that our KAN achieves state-of-the-art performance on three challenging VQA data sets, i.e., VQA v2, VQA-CP v2, and FVQA. In addition, our open-domain knowledge is also beneficial to VQA baselines. Code is available at https://github.com/yyyanglz/KAN.

Full Text