Learning to Ask Informative Sub-Questions for Visual Question Answering

Kohei Uehara,Nan Duan,Tatsuya Harada

doi:10.1109/cvprw56347.2022.00514

Abstract

VQA (Visual Question Answering) model tends to make incorrect inferences for questions that require reasoning over world knowledge. Recent study has shown that training VQA models with questions that provide lower-level perceptual information along with reasoning questions improves performance. Inspired by this, we propose a novel VQA model that generates questions to actively obtain auxiliary perceptual information useful for correct reasoning. Our model consists of a VQA model for answering questions, a Visual Question Generation (VQG) model for generating questions, and an Info-score model for estimating the amount of information the generated questions contain, which is useful in answering the original question. We train the VQG model to maximize the "informativeness" provided by the Info-score model to generate questions that contain as much information as possible, about the answer to the original question. Our experiments show that by inputting the generated questions and their answers as additional information to the VQA model, it can indeed predict the answer more correctly than the baseline model.

Full Text