Abstract

While models for Visual Question Answering (VQA) have steadily improved over the years, interacting with one quickly reveals that these models lack consistency. For instance, if a model answers “red” to “What color is the balloon?”, it might answer “no” if asked, “Is the balloon red?”. These responses violate simple notions of entailment and raise questions about how effectively VQA models ground language. In this work, we introduce a dataset, ConVQA, and metrics that enable quantitative evaluation of consistency in VQA. For a given observable fact in an image (e.g. the balloon’s color), we generate a set of logically consistent question-answer (QA) pairs (e.g. Is the balloon red?) and also collect a human-annotated set of common-sense based consistent QA pairs (e.g. Is the balloon the same color as tomato sauce?). Further, we propose a consistency-improving data augmentation module, a Consistency Teacher Module (CTM). CTM automatically generates entailed (or similar-intent) questions for a source QA pair and fine-tunes the VQA model if the VQA’s answer to the entailed question is consistent with the source QA pair. We demonstrate that our CTM-based training improves the consistency of VQA models on the Con-VQA datasets and is a strong baseline for further research.

Highlights

  • Visual Question Answering (VQA) (Antol et al, 2015) involves answering natural language questions about images

  • To improve the consistency of VQA models, we propose a Consistency Teacher Module (CTM), which consists of a Question Generator that synthesizes entailed questions given a seed QA pair and a Consistency Checker that examines whether answers to those similar-intent questions are consistent

  • We demonstrate that our approach improves the performance of a baseline VQA model on our ConVQA testing sets in terms of both accuracy and consistency

Read more

Summary

Introduction

Visual Question Answering (VQA) (Antol et al, 2015) involves answering natural language questions about images. Consistent Question-Answer (QA) pairs can be derived based on simple notions of logic or by commonsense reasoning. If an image contains “vegetarian pizza”, commonsense-based QA pairs can be “is it a vegetarian pizza? While attempts have been made to construct logic-based consistent VQA datasets (Hudson and Manning, 2019), they still fall short on commonsense-based consistency. To improve the consistency of VQA models, we propose a Consistency Teacher Module (CTM), which consists of a Question Generator that synthesizes entailed (or similar-intent) questions given a seed QA pair and a Consistency Checker that examines whether answers to those similar-intent questions are consistent. Our datasets and models will be available at https://bit.ly/32exlM7

Related Work
ConVQA Datasets
Approach
Experiments
Results and Analysis
Conclusion and Discussion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.