Knowledge Blended Open Domain Visual Question Answering using Transformer

Dipali Koshti,Mukesh Kalla,Ashutosh Gupta

doi:10.1109/icais56108.2023.10073911

Abstract

Interacting with an image in the form of dialog is one of the challenging applications of the vision-language model. Image question answering allows us to interact with an image in form of question and answer. Ask any question about the image and the machine will generate an answer in a natural language. Not all questions are image-dependent; some of the questions may require external knowledge. Integrating external knowledge in an image question-answering model has been an open research area. A novel knowledge-incorporated image question-answering model based on a transformer using deep co-attention has been proposed. The model leverages the structured knowledge present in the ConceptNet. Important objects from the image and important keywords from the question are extracted. Using these extracted objects and text keywords, related concepts from the ConceptNet have been extracted. The top five most related concepts have been considered for further processing. A novel attention mechanism using a transformer has been introduced to combine this external knowledge with the Visual question answering model. The proposed model is evaluated based on VQA 2.0 dataset. The experimental results show that the incorporation of the external knowledge base in the VQA model allows the model to answer more complex open-domain questions and achieves the accuracy of 67.97% on VQA validation set.

Full Text