Abstract

Visual Question Answering (VQA) is a very challenging task, which requires to understand visual images and natural language questions simultaneously. In the open-ended VQA task, most previous solutions focus on understanding the question and image contents, as well as their correlations. However, they mostly reason the answers in a one-stage way, which results in that the generated answers are significantly ignored. In this paper, we propose a novel approach, termed Cascaded-Answering Model~(CAM), which extends the conventional one-stage VQA model to a two-stage model. Hence, the proposed model can fully explore the semantics embedded in the predicted answers. Specifically, our CAM is composed of two cascaded answering modules: Candidate Answer Generation~(CAG) module and Final Answer Prediction~(FAP) module. In CAG module, we select multiple relevant candidates from the generated answers using a typical VQA approach with Co-Attention. While in FAP module, we integrate the information of question and image, together with the semantics explored from the selected candidate answers to predict the final answer. Experimental results demonstrate that our proposed model produces high-quality candidate answers and achieves the state-of-the-art performance on three large benchmark datasets, VQA-1.0, VQA-2.0 and VQA-CP v2.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.