Integrating Transformer into Global and Residual Image Feature Extractor in Visual Question Answering for Blind People

Tung Le,Nguyen Le Minh,Nguyen Tien Huy

doi:10.1109/kse50997.2020.9287539

Abstract

Visual Question Answering (VQA), the novel task among the intersection between Computer Vision (CV) and Natural Language Processing (NLP), extracts answers from features of both questions and images. The current approaches in VQA rely on the combination between convolution and recurrent networks, which leads to the huge number of parameters for learning phase. With the success of employing pre-trained models, we integrate BERT [1] for embedding text and two models: ResNets [2] and VGG [3] for embedding image. In addition, we also propose to take advantages of fine-tuning techniques and stacked attention mechanism to combine textual and visual features in a novel learning phase considered its ability to reduce the size of models. To demonstrate our model's performance, we conduct experiments in the VizWiz VQA Challenge 2020. According to the experimental results, the proposed approach outperforms existing methods for Yes-No questions on VizWiz VQA dataset.

Full Text