Abstract

Visual Question Answering (VQA) aims to answer questions according to the given image. However, current VQA models tend to rely solely on textual information from the questions and ignore the visual information in the images to get answers, which is caused by bias that is generated during the training phase. Previous studies have shown that bias in VQA is mainly caused by the text modality, and our analysis suggests that question type is a crucial factor in bias formation. To address this bias, we proposed a self-supervised method including the Against Biased Samples(ABS) module that performs targeted debiasing by selecting samples that are prone to bias, and the Shuffle Question types(SQT) module that constructs negative samples by randomly replacing the question types of the samples selected by the ABS, to interrupting the shortcuts from question type to answer. Our approach mitigates the question-to-answer bias without using external annotations, overcoming the prior language problem. Additionally, we designed a new objective function for negative samples. Experimental results indicate that our method outperforms both self-supervised-based and supervised-based state-of-the-art approaches, achieving 70.36% accuracy on the VQA-CP v2 dataset.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call