Attention-based Image Captioning Assisted Visual Question Answering

Hantao Xu,Xia Ye,Zhangping Yang,Pujie Zhao

doi:10.1088/1742-6596/2577/1/012001

Hantao Xu, Xia Ye + Show 2 more

Open Access

https://doi.org/10.1088/1742-6596/2577/1/012001

Copy DOI

Abstract

Visual question answering has received increasing attention as a multimodal task whose goal is to be able to answer natural language-related questions by reasoning about a given picture. The SOTA model in the current mainstream visual question answering usually uses language bias in the training data to predict the question, resulting in the VQA model often lacking an image foundation. Based on the attention mechanism, this paper designs a method to use image description to assist the training of a visual question-answering model. The model is problem-oriented and encodes images and image descriptions based on a collaborative attention mechanism, which enhances the feature representation of the model and the ability to learn image information, making the model more robust and generalizable. The experimental results show that the accuracy of this method on the dataset VQA v2 has increased by 3.89% compared with the baseline, which proves that the method is effective.

Full Text