Abstract

Visual Question Answering (VQA) aims to evaluate the reasoning abilities of an intelligent agent using visual and textual information. However, recent research indicates that many VQA models rely primarily on learning the correlation between questions and answers in the training dataset rather than demonstrating actual reasoning ability. To address this limitation, we propose a novel training approach called Enhancing Robust VQA via Contrastive and Self-supervised Learning (CSL-VQA) to construct a more robust VQA model. Our approach involves generating two types of negative samples to balance the biased data, using self-supervised auxiliary tasks to help the base VQA model overcome language priors, and filtering out biased training samples. In addition, we construct positive samples by removing spurious correlations in biased samples and perform auxiliary training through contrastive learning. Our approach does not require additional annotations and is compatible with different VQA backbones. Experimental results demonstrate that CSL-VQA significantly outperforms current state-of-the-art approaches, achieving an accuracy of 62.30% on the VQA-CP v2 dataset, while maintaining robust performance on the in-distribution VQA v2 dataset. Moreover, our method shows superior generalization capabilities on challenging datasets such as GQA-OOD and VQA-CE, proving its effectiveness in reducing language bias and enhancing the overall robustness of VQA models.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.