Abstract

Visual question generation task aims to generate meaningful questions about an image according to a target answer. Existing studies mainly focus on merely one object related to the target answer in an image to generate a question. However, a target answer is often related to multiple key objects in an image, which focuses on only one object may mislead its model to generate questions that are only related to partial fragments of the answer. To address this problem, we propose a multi-objects aware generation model to capture all key objects related to an answer and generate the corresponding question. We first introduce a co-attention network to capture the relationship between each object in an image and the answer, and then extract the key objects that are related to the answer. Then, a graph network is introduced to capture the relationships between the key objects and other objects in the image that are not related to the answer, which helps generate questions that involve more visual content. Finally, the learned information from the graph network is fed into a standard decoder module to produce questions. Extensive experiments on the VQA v2.0 dataset show that the proposed model outperforms the state-of-the-art models.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call