Abstract
In visual question answering task, it is vital to learn the semantic interactions between the question and target objects in the input image. Existing scene graph-based methods generally extract global features from the image and then perform feature fusion with the question representation. However, the scene graph constructed by these methods only obtains the abstract semantic features from the image, but does not consider the influence of the positional words and semantic information in question. In this paper, we propose a Question-aware Dynamic Scene Graph (QDSG) method. Firstly, we adopt a scene graph of the initial state based on the local attribute features of the image target. Then we design a dynamic scene graph adaptive to different questions based on the initial scene graph, which is used a word-level co-attention mechanism to refine node features and edge features. Finally, iterative reasoning is performed on the refined scene graph and the correct answer is predicted by using the graph attention network model. The proposed method is sufficient to learn the semantic local features to generate the interactive scene graph between the image and question, which is beneficial to the logistic reasoning depending on the adaptive graph refinement. The proposed method outperforms the comparative performance when compared with state-of-the-art models on the GQA dataset and its semantic and structural type datasets.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.