A Visual Question Answering System using YOLO Model

Soumya Priyadarsini Panda,Navin Chandra

doi:10.1109/ocit53463.2021.00059

Abstract

A Visual Question Answering System (VQAS) is an intelligent system that can provide answers to different user queries about images. Such a system may be helpful for the visually impaired person to know the environment or about the images verbally by asking specific questions. Also, processing a large number of images and fetching the required information quickly from them is a challenging task. The majority of the image retrieval systems use the captions assigned to the image data and perform matching with the user query terms to fetch the relevant images. Image retrieval based on content analysis with respect to different user-entered queries in natural languages is a challenging task and is an active area of research. In this paper, we present a natural language query-based Visual Question Answering (VQA) system which can retrieve the images based on their content match as per the user query and also can provide answers to different user queries about the images. The presented model uses the YOLO (You Only Look Once) object detection model to identify different objects present in images and to decide the context. The presented model is tested on a varied number of user queries about different images and the results obtained show the effectiveness of the model in presenting answers to different user queries. The presented model achieves an average accuracy of 96% on the considered testing image data set and user queries.

Full Text