The Visual Question Answering (VQA) task is an important research direction in the field of artificial intelligence, which requires a model that can simultaneously understand visual images and natural language questions, and answer questions related to images. Recent studies have shown that many Visual Question Answering models rely on statistically regular correlations between questions and answers, which in turn weakens the correlation between visual content and textual information. In this work, we propose an unbiased Visual Question Answering method to solve language priors from the perspective of strengthening the contrast between the correct answer and the positive and negative predictions. We design a new model consisting of two modules with different roles. We input the image and the question corresponding to it into the Answer Visual Attention Modules to generate positive prediction output, and then use a Dual Channels Joint Module to generate negative prediction output with great linguistic prior knowledge. Finally, we input the positive and negative predictions together with the correct answer to our newly designed loss function for training. Our method achieves high performance (61.24%) on the VQA-CP v2 dataset. In addition, most existing debiasing methods improve performance on VQA-CP v2 dataset at the cost of reducing performance on VQA v2 dataset, while our method not only does not reduce the accuracy on VQA v2 dataset. Instead, it improves performance on both datasets mentioned above.
Read full abstract