More Than An Answer

Yiyi Zhou,Jinsong Su,Rongrong Ji,Yongjian Wu,Yunsheng Wu

doi:10.1145/3123266.3123335

Abstract

Most of existing works in visual question answering (VQA) are dedicated to improving the performance of answer predictions, while leaving the explanation of answering unexploited. We argue that, exploiting the explanations of question answering not only makes VQA explainable, but also quantitatively improves the prediction performance. In this paper, we propose a novel network architecture, termed Neural Pivot Network (NPN), towards simultaneous VQA and generating explanations in a multi-task learning architecture. NPN is trained by using both image-caption and image-question-answer pairs. In principle, CNN-based deep visual features are extracted and sent to both the VQA channel and the captioning module, the latter of which serves as a pivot to bridge the source image module to the target QA predictor. Such an innovative design enables us to introduce large-scale image-captioning training sets, e.g., MS-COCO Caption and Visual Genome Caption, together with cutting-edge image captioning models to benefit VQA learning. Quantitatively, the proposed NPN performs significantly better than alternatives and state-of-the-art schemes trained on VQA datasets only. Besides, by investigating the by-product of experiments, in-depth digests can be provided along with the answers.

Full Text