Abstract
Collaborative reasoning for knowledge-based visual question answering is challenging but vital and efficient in understanding the features of the images and questions. While previous methods jointly fuse all kinds of features by attention mechanism or use handcrafted rules to generate a layout for performing compositional reasoning, which lacks the process of visual reasoning and introduces a large number of parameters for predicting the correct answer. For conducting visual reasoning on all kinds of image–question pairs, in this paper, we propose a novel reasoning model of a question-guided tree structure with a knowledge base (QGTSKB) for addressing these problems. In addition, our model consists of four neural module networks: the attention model that locates attended regions based on the image features and question embeddings by attention mechanism, the gated reasoning model that forgets and updates the fused features, the fusion reasoning model that mines high-level semantics of the attended visual features and knowledge base and knowledge-based fact model that makes up for the lack of visual and textual information with external knowledge. Therefore, our model performs visual analysis and reasoning based on tree structures, knowledge base and four neural module networks. Experimental results show that our model achieves superior performance over existing methods on the VQA v2.0 and CLVER dataset, and visual reasoning experiments prove the interpretability of the model.
Highlights
Visual question answering (VQA) is an intersecting field of computer vision and natural language processing, which has just been proposed in recent years
The image features extracted by convolution neural network (CNN) are fused with the encoded text features, and the fused features are fed to the artificial neural network
We propose a reasoning model of a question-guided tree structure with a knowledge base (QGTSKB) for addressing these problems
Summary
Visual question answering (VQA) is an intersecting field of computer vision and natural language processing, which has just been proposed in recent years. The image features extracted by convolution neural network (CNN) are fused with the encoded text features, and the fused features are fed to the artificial neural network These methods turn the visual question answering into a multi-label classification task. We propose a reasoning model of a question-guided tree structure with a knowledge base (QGTSKB) for addressing these problems. The attention model uses the word encodings of the current tree node, and fuses the attention map of the child node with the relationship between words from the knowledge base to extract local visual evidence for explicit reasoning. The attention map of each node is considered as a qualitative experimental result in the process of explicit visual reasoning, which further shows that our model is interpretable and has strong adaptability to different tasks.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.