Multi visual and textual embedding on visual question answering for blind people

Tung Le,Huy Tien Nguyen,Minh Le Nguyen

doi:10.1016/j.neucom.2021.08.117

Abstract

Visual impairment community, especially blind people have a thirst for assistance from advanced technologies for understanding and answering the image. Through the development and intersection between vision and language, Visual Question Answering (VQA) is to predict an answer from a textual question on an image. It is essential and ideal to help blind people with capturing the image and answering their questions automatically. Traditional approaches often utilize the strength of convolution and recurrent networks, which requires a great effort for learning and optimizing. A key challenge in VQA is finding an effective way to extract and combine textual and visual features. To take advantage of previous knowledge in different domains, we propose BERT-RG, the delicate integration of pre-trained models into feature extractors, which relies on the interaction between residual and global features in the image and linguistic features in the question. Moreover, our architecture integrates a stacked attention mechanism that exploits the relationship between textual and visual objects. Specifically, the partial regions of images interact with partial keywords in question to enhance the text-vision representation. Besides, we also propose a novel perspective by considering a specific question type in VQA. Our proposal is significantly meaningful enough to develop a specialized system instead of putting forth the effort to dig for unlimited and unrealistic approaches. Experiments on VizWiz-VQA, a practical benchmark dataset, show that our proposed model outperforms existing models on the VizWiz VQA dataset in the Yes/No question type.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Multi visual and textual embedding on visual question answering for blind people

Abstract

Talk to us

Similar Papers

More From: Neurocomputing

Lead the way for us

Journal: Neurocomputing	Publication Date: Sep 3, 2021
Citations: 7

Similar Papers

Integrating Transformer into Global and Residual Image Feature Extractor in Visual Question Answering for Blind People
Tung Le ... Nguyen Le Minh
-
Tung Le, et. al.Tung Le ... Nguyen Le Minh
12 Nov 2020
12 Nov 2020

Multimodal feature fusion by relational reasoning and attention for visual question answering
Weifeng Zhang ... Zengchang Qin
Information Fusion | VOL. 55
Weifeng Zhang, et. al.Weifeng Zhang ... Zengchang Qin
19 Aug 2019
Information Fusion | VOL. 55

Accuracy vs. complexity: A trade-off in visual question answering models
Moshiur Farazi ... Nick Barnes
Pattern Recognition | VOL. 120
Moshiur Farazi, et. al.Moshiur Farazi ... Nick Barnes
12 Jun 2021
Pattern Recognition | VOL. 120

Enhancing visual question answering with a two‐way co‐attention mechanism and integrated multimodal features
Mayank Agrawal ... Anand Singh Jalal
Computational Intelligence | VOL. 40
Mayank Agrawal, et. al.Mayank Agrawal ... Anand Singh Jalal
21 Dec 2023
Computational Intelligence | VOL. 40

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Multi visual and textual embedding on visual question answering for blind people

Abstract

Talk to us

Similar Papers

More From: Neurocomputing