Abstract

Visual question answering (VQA) is a task that takes an image and a related natural language question as input, and produces an answer as output. A successful VQA algorithm requires two key components: to obtain a structured representation of an image and to process a natural language question on the structured representation. While traditional VQA tasks work on raw images or image segmentation, recent VQA datasets such as CLEVR and GQA provide scene graphs that capture objects and their relationships expressed inside an image. However, even when the ground-truth scene graph is given, it is non-trivial to get the right answer to a natural language question, as it needs a sophisticated algorithm to process the scene graph and the question together. We propose to encode a scene graph and a question using Graph Network (GN). Then, we feed the encoded graph with the question to the Memory, Attention, and Composition (MAC) model to classify the answer. By including the question as a global vector in GN, we achieved the accuracy of 96.3% in GQA, surpassing 83.5% of the baseline method reported by the authors of GQA, which also used MAC to classify the answer. Our work suggests that a context-based encoding of the scene graph is crucial for graph-based reasoning tasks such as graph-related question answering.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call