Abstract

We introduce a new neural network architecture, Multimodal Neural Graph Memory Networks (MN-GMN), for visual question answering. Our novel approach uses graph structure with different region features as node attributes and applies a recently proposed powerful graph neural network model, Graph Network (GN), to reason about objects and their interactions in the scene context. The input module of the MN-GMN generates a set of visual features plus a set of region-grounded captions (RGCs) for the image. The RGCs capture object attributes and their relationships. Two GNs are constructed from the input module using visual features and RGCs. Each node of the GNs iteratively computes a question-guided contextualized representation of the visual/textual information assigned to it. To combine the information from both GNs, each node writes the updated representations to an external spatial memory. The final states of the memory cells are fed into an answer module to predict an answer. Experiments show that MN-GMN rivals the state-of-the-art models on the Visual7W, VQA-v2.0, and CLEVR datasets.

Highlights

  • Visual question answering (VQA) has been recently introduced as a grand challenge for AI

  • This paper proposes a new neural network architecture for VQA based on the recent Graph Network (GN) (Battaglia et al, 2018)

  • We introduce a new memory network architecture, based on graph neural networks, which can reason about complex arrangements of objects in a scene to answer visual questions

Read more

Summary

Introduction

Visual question answering (VQA) has been recently introduced as a grand challenge for AI. The pairwise interactions between various regions of an image and spatial context in both horizontal and vertical directions are important to answer questions about objects and their interactions in the scene context. Our new architecture (see Figure 2), Multimodal Neural Graph Memory Network (MN-GMN), uses a Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7177–7188 July 5 - 10, 2020. C 2020 Association for Computational Linguistics graph structure to represent pairwise interactions between visual/textual features (nodes) from different regions of an image. GNs provide a contextaware neural mechanism for computing a feature for each node that represents complex interactions with other nodes. This enables our MN-GMN to answer questions that need reasoning about complex arrangements of objects in a scene

Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.