Abstract

Visual Question Answering (VQA), the novel task among the intersection between Computer Vision (CV) and Natural Language Processing (NLP), extracts answers from features of both questions and images. The current approaches in VQA rely on the combination between convolution and recurrent networks, which leads to the huge number of parameters for learning phase. With the success of employing pre-trained models, we integrate BERT [1] for embedding text and two models: ResNets [2] and VGG [3] for embedding image. In addition, we also propose to take advantages of fine-tuning techniques and stacked attention mechanism to combine textual and visual features in a novel learning phase considered its ability to reduce the size of models. To demonstrate our model's performance, we conduct experiments in the VizWiz VQA Challenge 2020. According to the experimental results, the proposed approach outperforms existing methods for Yes-No questions on VizWiz VQA dataset.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.