Abstract

Visual Question Answering (VQA) is a fast developing field involving multiple disciplines, and it is constantly challenging more complex tasks. The classic combination of CNN+LSTM can effectively extract images and language representation to complete the VQA task, but there are still many problems, such as excessively long sequence processing, etc. In recent years, BERT model has expanded rapidly from the field of natural language processing to a broader multi-modal field with its strong learning ability. In this paper, we propose a novel way to apply BERT model in the VQA field. We use the descriptive paragraph generation technology to transform the picture into a text paragraph description, and integrate question information and image information on BERT model. Our model achieves an excellent performance on the VQA2.0 dataset with an overall accuracy 5% higher than previous models.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call