Abstract

Visual question answering has received increasing attention as a multimodal task whose goal is to be able to answer natural language-related questions by reasoning about a given picture. The SOTA model in the current mainstream visual question answering usually uses language bias in the training data to predict the question, resulting in the VQA model often lacking an image foundation. Based on the attention mechanism, this paper designs a method to use image description to assist the training of a visual question-answering model. The model is problem-oriented and encodes images and image descriptions based on a collaborative attention mechanism, which enhances the feature representation of the model and the ability to learn image information, making the model more robust and generalizable. The experimental results show that the accuracy of this method on the dataset VQA v2 has increased by 3.89% compared with the baseline, which proves that the method is effective.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call