Abstract

Visual question answering (VQA), along with multiple types of image and textual questions, makes it a challenging task to infer the correct answer. Consequently, traditional methods rely on relevant cross-modal objects and seldom leverage the cooperation of visual appearance and textual understanding. Here, we present LRBNet, a model-based approach to the problem by analyzing VQA as a division of labor strategy. Using a dictionary and pre-trained GloVe vectors, we embedded region captions and questions by the LSTM. Furthermore, we use graphs to model image region captions and features and then feed them into two GNN-based networks to capture the semantic and visual relations. Finally, we modulate the vertex features of multimodal graphs. Thus, the question embedding and vertex features are fed into the multi-level answer predictor to produce the results. We experimentally validate that LRBNet is an exciting framework for visual–textual understanding and a more challenging alternative to better VQA because it needs to understand the image to be successful. Our study provides complementary prediction through hierarchical representation within and beyond the interactive understanding of the textual sequence and the images, and the experimental results show that LRBNet outperforms other leading models in most cases.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.