Abstract

Visual Question Answering (VQA) is a comprehensive task to answer questions about the visual contents of an image. However, a number of studies have pointed out that VQA models rely heavily on superficial correlations between question and answer, and predict the answer just according to the textual statistical correlations without truly understanding the visual contents. To address this issue, we propose an answer re-ranking VQA model, called as RankVQA, in which the roles of the input image are re-examined to select the most relevant answer from a set of candidate answers generated by a typical VQA model. Specifically, we rank the candidate answers with their relevance to visual content of the input image and some question-related image captions respectively. Extensive experiments on the two datasets, i.e., VQA v2 and VQACP v2, demonstrate the effectiveness of the proposed model, and the state-of-the-art performance on both the datasets are achieved.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call