Abstract
While models for Visual Question Answering (VQA) have steadily improved over the years, interacting with one quickly reveals that these models lack consistency. For instance, if a model answers “red” to “What color is the balloon?”, it might answer “no” if asked, “Is the balloon red?”. These responses violate simple notions of entailment and raise questions about how effectively VQA models ground language. In this work, we introduce a dataset, ConVQA, and metrics that enable quantitative evaluation of consistency in VQA. For a given observable fact in an image (e.g. the balloon’s color), we generate a set of logically consistent question-answer (QA) pairs (e.g. Is the balloon red?) and also collect a human-annotated set of common-sense based consistent QA pairs (e.g. Is the balloon the same color as tomato sauce?). Further, we propose a consistency-improving data augmentation module, a Consistency Teacher Module (CTM). CTM automatically generates entailed (or similar-intent) questions for a source QA pair and fine-tunes the VQA model if the VQA’s answer to the entailed question is consistent with the source QA pair. We demonstrate that our CTM-based training improves the consistency of VQA models on the Con-VQA datasets and is a strong baseline for further research.
Highlights
Visual Question Answering (VQA) (Antol et al, 2015) involves answering natural language questions about images
To improve the consistency of VQA models, we propose a Consistency Teacher Module (CTM), which consists of a Question Generator that synthesizes entailed questions given a seed QA pair and a Consistency Checker that examines whether answers to those similar-intent questions are consistent
We demonstrate that our approach improves the performance of a baseline VQA model on our ConVQA testing sets in terms of both accuracy and consistency
Summary
Visual Question Answering (VQA) (Antol et al, 2015) involves answering natural language questions about images. Consistent Question-Answer (QA) pairs can be derived based on simple notions of logic or by commonsense reasoning. If an image contains “vegetarian pizza”, commonsense-based QA pairs can be “is it a vegetarian pizza? While attempts have been made to construct logic-based consistent VQA datasets (Hudson and Manning, 2019), they still fall short on commonsense-based consistency. To improve the consistency of VQA models, we propose a Consistency Teacher Module (CTM), which consists of a Question Generator that synthesizes entailed (or similar-intent) questions given a seed QA pair and a Consistency Checker that examines whether answers to those similar-intent questions are consistent. Our datasets and models will be available at https://bit.ly/32exlM7
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.