Abstract

Existing Visual Question Answering models suffer from the language prior, where the answers provided by the models overly rely on the correlations between questions and answers, ignoring the exact visual information, resulting in a significant drop in the out-of-distribution datasets. To eliminate such language bias, prevalent approaches mainly focus on weakening the language prior with one auxiliary question-only branch while focusing on the statistical question type–answer pairs’ distribution prior rather than that of question–answer pairs. Besides, most models provide the answer with improper visual groundings. This paper proposes a model-agnostic framework to address the above drawbacks by question-conditioned debiasing with focal visual context fusion. To begin with, instead of the question type-conditioned correlations, we overcome the language distribution shortcut from the aspect of question-conditioned correlations by removing the shortcut between questions and the most occurring answer. Additionally, we utilize the deviation of the predicted answer distribution and ground truth as the pseudo target to avoid the model falling into other frequent answers’ distribution bias. Further, we stress the imbalance of the number of images and questions that post higher requirements of a proper visual context. We improve the correct visual utilization ability based on contrastive sampling and design a focal visual context fusion module that incorporates the critical object word extracted from the question after the Part-Of-Speech tagging into the visual features to augment the salient visual information without human annotations. Extensive experiments on the three public benchmark datasets, i.e., VQA v2, VQA-CP v2, and VQA-CP v1, demonstrate the effectiveness of our model.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call