Abstract

Most of existing works in visual question answering (VQA) are dedicated to improving the performance of answer predictions, while leaving the explanation of answering unexploited. We argue that, exploiting the explanations of question answering not only makes VQA explainable, but also quantitatively improves the prediction performance. In this paper, we propose a novel network architecture, termed Neural Pivot Network (NPN), towards simultaneous VQA and generating explanations in a multi-task learning architecture. NPN is trained by using both image-caption and image-question-answer pairs. In principle, CNN-based deep visual features are extracted and sent to both the VQA channel and the captioning module, the latter of which serves as a pivot to bridge the source image module to the target QA predictor. Such an innovative design enables us to introduce large-scale image-captioning training sets, e.g., MS-COCO Caption and Visual Genome Caption, together with cutting-edge image captioning models to benefit VQA learning. Quantitatively, the proposed NPN performs significantly better than alternatives and state-of-the-art schemes trained on VQA datasets only. Besides, by investigating the by-product of experiments, in-depth digests can be provided along with the answers.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.