Abstract

Recent observations have revealed that Visual Question Answering models are susceptible to learning the spurious correlations formed by dataset biases, i.e., the language priors, instead of the intended solution. For instance, given a question and a relative image, some VQA systems are prone to provide the frequently occurring answer in the dataset while disregarding the image content. Such a preferred tendency has caused them to be brittle in real-world settings, harming the robustness of VQA models. We experimentally found that conventional VQA methods often confuse negative samples that with identical questions but different images, which results in the generation of linguistic bias. In this paper, we propose a simple contrastive learning scheme, namely SCLSM, to mitigate the above issues in a self-supervised manner. We construct several special negative samples and introduce a debiasing-aware contrastive learning approach to help the model learn more discriminative multimodal features, thus improving the ability of debiasing. The SCLSM is compatible with numerous VQA baselines. Experimental results on the widely-used public datasets VQA-CP v2 and VQA v2 validate the effectiveness of our proposed model.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call