Abstract

Recent observations have revealed that Visual Question Answering models are susceptible to learning the spurious correlations formed by dataset biases, i.e., the language priors, instead of the intended solution. For instance, given a question and a relative image, some VQA systems are prone to provide the frequently occurring answer in the dataset while disregarding the image content. Such a preferred tendency has caused them to be brittle in real-world settings, harming the robustness of VQA models. We experimentally found that conventional VQA methods often confuse negative samples that with identical questions but different images, which results in the generation of linguistic bias. In this paper, we propose a simple contrastive learning scheme, namely SCLSM, to mitigate the above issues in a self-supervised manner. We construct several special negative samples and introduce a debiasing-aware contrastive learning approach to help the model learn more discriminative multimodal features, thus improving the ability of debiasing. The SCLSM is compatible with numerous VQA baselines. Experimental results on the widely-used public datasets VQA-CP v2 and VQA v2 validate the effectiveness of our proposed model.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.