Abstract

This paper aims at assisting visually impaired people through Deep Learning (DL) by providing a system that can describe the surroundings as well as answer questions about the surroundings of the user. The system majorly consists of two models, an Image Captioning (IC) model, and a Visual Question Answering (VQA) model. The IC model is a Convolutional Neural Network and Recurrent Neural Network based architecture that incorporates a form of attention while captioning. This paper proposes two models, Multi-Layer Perceptron based and Long Short Term Memory (LSTM) based, for the VQA task that answer questions related to the input image. The IC model has achieved an average BLUE 1 score of 0.46. The LSTM based VQA model has given an overall accuracy of 47 percent. These two models are integrated along with Speech to Text and Text to Speech components to form a single system that works in real time.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call