Question-conditioned debiasing with focal visual context fusion for visual question answering

Jin Liu,Guoxiang Wang,Chongfeng Fan,Fengyu Zhou,Huijuan Xu

doi:10.1016/j.knosys.2023.110879

Abstract

Existing Visual Question Answering models suffer from the language prior, where the answers provided by the models overly rely on the correlations between questions and answers, ignoring the exact visual information, resulting in a significant drop in the out-of-distribution datasets. To eliminate such language bias, prevalent approaches mainly focus on weakening the language prior with one auxiliary question-only branch while focusing on the statistical question type–answer pairs’ distribution prior rather than that of question–answer pairs. Besides, most models provide the answer with improper visual groundings. This paper proposes a model-agnostic framework to address the above drawbacks by question-conditioned debiasing with focal visual context fusion. To begin with, instead of the question type-conditioned correlations, we overcome the language distribution shortcut from the aspect of question-conditioned correlations by removing the shortcut between questions and the most occurring answer. Additionally, we utilize the deviation of the predicted answer distribution and ground truth as the pseudo target to avoid the model falling into other frequent answers’ distribution bias. Further, we stress the imbalance of the number of images and questions that post higher requirements of a proper visual context. We improve the correct visual utilization ability based on contrastive sampling and design a focal visual context fusion module that incorporates the critical object word extracted from the question after the Part-Of-Speech tagging into the visual features to augment the salient visual information without human annotations. Extensive experiments on the three public benchmark datasets, i.e., VQA v2, VQA-CP v2, and VQA-CP v1, demonstrate the effectiveness of our model.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Question-conditioned debiasing with focal visual context fusion for visual question answering

Abstract

Talk to us

Similar Papers

More From: Knowledge-Based Systems

Lead the way for us

Journal: Knowledge-Based Systems	Publication Date: Aug 6, 2023
Citations: 3

Similar Papers

Automated Question Generation and Answer Verification Using Visual Data
Shrey Nahar ... Niti Shah
-
Shrey Nahar, et. al.Shrey Nahar ... Niti Shah
01 Jan 2020
01 Jan 2020

A robust visual question answering approach to reduce multimodal bias
Zhang Fengshuo ... Chen Yufeng
Scientific Insights and Discoveries Review | VOL. 4
Zhang Fengshuo, et. al.Zhang Fengshuo ... Chen Yufeng
14 Oct 2024
Scientific Insights and Discoveries Review | VOL. 4

Image Captioning by Asking Questions
Xiaoshan Yang ... Changsheng Xu
ACM Transactions on Multimedia Computing, Communications, and Applications | VOL. 15
Xiaoshan Yang, et. al.Xiaoshan Yang ... Changsheng Xu
30 Apr 2019
ACM Transactions on Multimedia Computing, Communications, and Applications | VOL. 15

On the role of question encoder sequence model in robust visual question answering
Gouthaman Kv ... Anurag Mittal
Pattern Recognition | VOL. 131
Gouthaman Kv, et. al.Gouthaman Kv ... Anurag Mittal
03 Jul 2022
Pattern Recognition | VOL. 131

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Question-conditioned debiasing with focal visual context fusion for visual question answering

Abstract

Talk to us

Similar Papers

More From: Knowledge-Based Systems