Abstract

Interacting with an image in the form of dialog is one of the challenging applications of the vision-language model. Image question answering allows us to interact with an image in form of question and answer. Ask any question about the image and the machine will generate an answer in a natural language. Not all questions are image-dependent; some of the questions may require external knowledge. Integrating external knowledge in an image question-answering model has been an open research area. A novel knowledge-incorporated image question-answering model based on a transformer using deep co-attention has been proposed. The model leverages the structured knowledge present in the ConceptNet. Important objects from the image and important keywords from the question are extracted. Using these extracted objects and text keywords, related concepts from the ConceptNet have been extracted. The top five most related concepts have been considered for further processing. A novel attention mechanism using a transformer has been introduced to combine this external knowledge with the Visual question answering model. The proposed model is evaluated based on VQA 2.0 dataset. The experimental results show that the incorporation of the external knowledge base in the VQA model allows the model to answer more complex open-domain questions and achieves the accuracy of 67.97% on VQA validation set.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.