Abstract

Visual question answering (VQA) aims at predicting an answer to a natural language question associated with an image. This work focuses on two important issues pertaining to VQA, which is a complex multimodal AI task: First, the task of answer prediction in a large output answer space, and second, to obtain enriched representation through cross-modality interactions. This work aims to address these two issues by proposing a dual attention (DA) and question categorization (QC)-based visual question answering model (DAQC-VQA). DAQC-VQA has three main network modules: First, a novel dual attention mechanism that helps toward the objective of obtaining an enriched cross-domain representation of the two modalities; second, a question classifier subsystem for identifying input (natural language) question category. The second module of question categorizer helps in reducing the answer search space; and third, a subsystem for predicting answer depending on the question category. All component networks of DAQC-VQA are trained in an end-to-end manner with a joint loss function. The performance of DAQC-VQA is evaluated on two widely used VQA datasets, viz., TDIUC and VQA2.0. Experimental results demonstrate competitive performance of DAQC-VQA against the recent state-of-the-art VQA models. An ablation analysis indicates that the enriched representation obtained using the proposed dual-attention mechanism helps improve performance.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call