Abstract
Visual Question Answering (VQA) is a challenging task that has received increasing attention from both the computer vision and the natural language processing communities. A VQA model combines visual and textual features in order to answer questions grounded in an image. Current works in VQA focus on questions which are answerable by direct analysis of the question and image alone. We present a concept-aware algorithm, ConceptBert, for questions which require common sense, or basic factual knowledge from external structured content. Given an image and a question in natural language, ConceptBert requires visual elements of the image and a Knowledge Graph (KG) to infer the correct answer. We introduce a multi-modal representation which learns a joint Concept-Vision-Language embedding inspired by the popular BERT architecture. We exploit ConceptNet KG for encoding the common sense knowledge and evaluate our methodology on the Outside Knowledge-VQA (OK-VQA) and VQA datasets.
Highlights
Visual Question Answering (VQA) was firstly introduced to bridge the gap between natural language processing and image understanding applications in the joint space of vision and language (Malinowski and Fritz, 2014).Most VQA benchmarks compute a question representation using word embedding techniques and Recurrent Neural Networks (RNNs), and a set of object descriptors comprising bounding box coordinates and image features vectors
Adding the Knowledge Graph (KG) embeddings to the model leads to a gain of 11.56% and 7.19% in VQA and Outside Knowledge-VQA (OK-VQA) datasets, respectively
Since we report our results on the validation set, we removed the validation set from the training phase, so that the model only relies on the training set
Summary
Visual Question Answering (VQA) was firstly introduced to bridge the gap between natural language processing and image understanding applications in the joint space of vision and language (Malinowski and Fritz, 2014).Most VQA benchmarks compute a question representation using word embedding techniques and Recurrent Neural Networks (RNNs), and a set of object descriptors comprising bounding box coordinates and image features vectors. Word and image representations are fused and fed to a network to train a VQA model. These approaches are practical when no knowledge beyond the visual content is required. Incorporating the external knowledge introduces several advantages. External knowledge and supporting facts can improve the relational representation between the objects detected in the image, or between entities in the question and objects in the image. It provides information on how the answer can be derived from the question. The complexity of the questions can be increased based on the supporting knowledge base
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.