Abstract

Existing Visual Question Answering models suffer from the language prior, where the answers provided by the models overly rely on the correlations between questions and answers, ignoring the exact visual information, resulting in a significant drop in the out-of-distribution datasets. To eliminate such language bias, prevalent approaches mainly focus on weakening the language prior with one auxiliary question-only branch while focusing on the statistical question type–answer pairs’ distribution prior rather than that of question–answer pairs. Besides, most models provide the answer with improper visual groundings. This paper proposes a model-agnostic framework to address the above drawbacks by question-conditioned debiasing with focal visual context fusion. To begin with, instead of the question type-conditioned correlations, we overcome the language distribution shortcut from the aspect of question-conditioned correlations by removing the shortcut between questions and the most occurring answer. Additionally, we utilize the deviation of the predicted answer distribution and ground truth as the pseudo target to avoid the model falling into other frequent answers’ distribution bias. Further, we stress the imbalance of the number of images and questions that post higher requirements of a proper visual context. We improve the correct visual utilization ability based on contrastive sampling and design a focal visual context fusion module that incorporates the critical object word extracted from the question after the Part-Of-Speech tagging into the visual features to augment the salient visual information without human annotations. Extensive experiments on the three public benchmark datasets, i.e., VQA v2, VQA-CP v2, and VQA-CP v1, demonstrate the effectiveness of our model.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.