VQA Model Based on Image Descriptive Paragraph and Deep Integration of BERT

Zhaochang Wu,Jianing Zhang,Huajie Zhang,Yunfang Chen

doi:10.1088/1742-6596/1624/2/022014

Zhaochang Wu, Jianing Zhang + Show 2 more

Open Access

https://doi.org/10.1088/1742-6596/1624/2/022014

Copy DOI

Abstract

Visual Question Answering (VQA) is a fast developing field involving multiple disciplines, and it is constantly challenging more complex tasks. The classic combination of CNN+LSTM can effectively extract images and language representation to complete the VQA task, but there are still many problems, such as excessively long sequence processing, etc. In recent years, BERT model has expanded rapidly from the field of natural language processing to a broader multi-modal field with its strong learning ability. In this paper, we propose a novel way to apply BERT model in the VQA field. We use the descriptive paragraph generation technology to transform the picture into a text paragraph description, and integrate question information and image information on BERT model. Our model achieves an excellent performance on the VQA2.0 dataset with an overall accuracy 5% higher than previous models.

Full Text