LRB-Net: Improving VQA via division of labor strategy and multimodal classifiers

Jiangfan Feng,Ruiguo Liu

doi:10.1016/j.displa.2022.102329

Abstract

Visual question answering (VQA), along with multiple types of image and textual questions, makes it a challenging task to infer the correct answer. Consequently, traditional methods rely on relevant cross-modal objects and seldom leverage the cooperation of visual appearance and textual understanding. Here, we present LRBNet, a model-based approach to the problem by analyzing VQA as a division of labor strategy. Using a dictionary and pre-trained GloVe vectors, we embedded region captions and questions by the LSTM. Furthermore, we use graphs to model image region captions and features and then feed them into two GNN-based networks to capture the semantic and visual relations. Finally, we modulate the vertex features of multimodal graphs. Thus, the question embedding and vertex features are fed into the multi-level answer predictor to produce the results. We experimentally validate that LRBNet is an exciting framework for visual–textual understanding and a more challenging alternative to better VQA because it needs to understand the image to be successful. Our study provides complementary prediction through hierarchical representation within and beyond the interactive understanding of the textual sequence and the images, and the experimental results show that LRBNet outperforms other leading models in most cases.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

LRB-Net: Improving VQA via division of labor strategy and multimodal classifiers

Abstract

Talk to us

Similar Papers

More From: Displays

Lead the way for us

Similar Papers

Multi visual and textual embedding on visual question answering for blind people
Tung Le ... Minh Le Nguyen
Neurocomputing | VOL. 465
Tung Le, et. al.Tung Le ... Minh Le Nguyen
03 Sep 2021
Neurocomputing | VOL. 465

VQAR: Review on Information Retrieval Techniques based on Computer Vision and Natural Language Processing
Shivangi Modi ... Dhatri Pandya
-
Shivangi Modi, et. al.Shivangi Modi ... Dhatri Pandya
01 Mar 2019
01 Mar 2019

Improving Visual Question Answering by Referring to Generated Paragraph Captions
Hyounghun Kim ... Mohit Bansal
-
Hyounghun Kim, et. al.Hyounghun Kim ... Mohit Bansal
01 Jan 2019
01 Jan 2019

Learning to Select Question-Relevant Relations for Visual Question Answering
Jaewoong Lee ... Heejoon Lee
-
Jaewoong Lee, et. al.Jaewoong Lee ... Heejoon Lee
01 Jan 2020
01 Jan 2020

Journal: Displays	Publication Date: Oct 28, 2022
Citations: 8

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

LRB-Net: Improving VQA via division of labor strategy and multimodal classifiers

Abstract

Talk to us

Similar Papers

More From: Displays