Abstract

Visual impairment community, especially blind people have a thirst for assistance from advanced technologies for understanding and answering the image. Through the development and intersection between vision and language, Visual Question Answering (VQA) is to predict an answer from a textual question on an image. It is essential and ideal to help blind people with capturing the image and answering their questions automatically. Traditional approaches often utilize the strength of convolution and recurrent networks, which requires a great effort for learning and optimizing. A key challenge in VQA is finding an effective way to extract and combine textual and visual features. To take advantage of previous knowledge in different domains, we propose BERT-RG, the delicate integration of pre-trained models into feature extractors, which relies on the interaction between residual and global features in the image and linguistic features in the question. Moreover, our architecture integrates a stacked attention mechanism that exploits the relationship between textual and visual objects. Specifically, the partial regions of images interact with partial keywords in question to enhance the text-vision representation. Besides, we also propose a novel perspective by considering a specific question type in VQA. Our proposal is significantly meaningful enough to develop a specialized system instead of putting forth the effort to dig for unlimited and unrealistic approaches. Experiments on VizWiz-VQA, a practical benchmark dataset, show that our proposed model outperforms existing models on the VizWiz VQA dataset in the Yes/No question type.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.