Abstract

Visual question answering (VQA) is a task to produce correct answers to questions about images. When given an irrelevant question to an image, existing models for VQA will still produce an answer rather than predict that the question is irrelevant. This situation shows that current VQA models do not truly understand images and questions. On the other hand, producing answers for irrelevant questions can be misleading in real-world application scenarios. To tackle this problem, we hypothesize that the abilities required for detecting irrelevant questions are similar to those required for answering questions. Based on this hypothesis, we study what performance a state-of-the-art VQA network can achieve when trained on irrelevant question detection. Then, we analyze the influences of reasoning and relational modeling on the task of irrelevant question detection. Our experimental results indicate that a VQA network trained on an irrelevant question detection dataset outperforms existing state-of-the-art methods by a big margin on the task of irrelevant question detection. Ablation studies show that explicit reasoning and relational modeling benefits irrelevant question detection. At last, we investigate a straight-forward idea of integrating the ability to detect irrelevant questions into VQA models by joint training with extended VQA data containing irrelevant cases. The results suggest that joint training has a negative impact on the model’s performance on the VQA task, while the accuracy on relevance detection is maintained. In this paper we claim that an efficient neural network designed for VQA can achieve high accuracy on detecting relevance, however integrating the ability to detect relevance into a VQA model by joint training will lead to degradation of performance on the VQA task.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call