Abstract

In visual question answering task, it is vital to learn the semantic interactions between the question and target objects in the input image. Existing scene graph-based methods generally extract global features from the image and then perform feature fusion with the question representation. However, the scene graph constructed by these methods only obtains the abstract semantic features from the image, but does not consider the influence of the positional words and semantic information in question. In this paper, we propose a Question-aware Dynamic Scene Graph (QDSG) method. Firstly, we adopt a scene graph of the initial state based on the local attribute features of the image target. Then we design a dynamic scene graph adaptive to different questions based on the initial scene graph, which is used a word-level co-attention mechanism to refine node features and edge features. Finally, iterative reasoning is performed on the refined scene graph and the correct answer is predicted by using the graph attention network model. The proposed method is sufficient to learn the semantic local features to generate the interactive scene graph between the image and question, which is beneficial to the logistic reasoning depending on the adaptive graph refinement. The proposed method outperforms the comparative performance when compared with state-of-the-art models on the GQA dataset and its semantic and structural type datasets.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call