Visual Question Answering

Dr Sai Madhavi D Dr Sai Madhavi D,Durga Shreya M Durga Shreya M,Manasa A Manasa A,Pooja U Joshi Pooja U Joshi

doi:10.48175/ijarsct-5763

Dr Sai Madhavi D Dr Sai Madhavi D, Durga Shreya M Durga Shreya M + Show 2 more

Open Access

https://doi.org/10.48175/ijarsct-5763

Copy DOI

Abstract

We propose the task of free-form and open- ended Visual Question Answering (VQA). Given an image and a natural language question about the image, the task is to provide an accurate natural language answer. Mirroring real-world scenarios, such as helping the visually impaired, both the questions and answers are open-ended. Visual questions selectively target different areas of an image, including background details and underlying context. As a result, a system that succeeds at VQA typically needs a more detailed understanding of the image and complex reasoning than a system producing generic image captions. Moreover, VQA is amenable to automatic evaluation, since many open- ended answers contain only a few words or a closed set of answers that can be provided in a multiple choice format. We provide a dataset containing ~0.25M images, ~0.76M questions, and ~10M answers and discuss the information it provides. Numerous baseline for VQA are provided and compared with human performance. In this model, we have exclusively introduced a feature of voice to text using Speech recognition, Google Text-to-Speech and pygame module.

Full Text