Abstract

Visual Question Answering (VQA) is a stimulating process in the field of Natural Language Processing (NLP) and Computer Vision (CV). In this process machine can find an answer to a natural language question which is related to an image. Question can be open-ended or multiple choice. Datasets of VQA contain mainly three components; questions, images and answers. Researchers overcome the VQA problem with deep learning based architecture that jointly combines both of two networks i.e. Convolution Neural Network (CNN) for visual (image) representation and Recurrent Neural Network (RNN) with Long Short Time Memory (LSTM) for textual (question) representation and trained the combined network end to end to generate the answer. Those models are able to answer the common and simple questions that are directly related to the image’s content. But different types of questions need different level of understanding to produce correct answers. To solve this problem, we use faster Region based-CNN (R-CNN) for extracting image features with an extra fully connected layer whose weights are dynamically obtained by LSTMs cell according to the question. We claim in this paper that a single R-CNN architecture can solve the problems related to VQA by modifying weights in the parameter prediction layer. Authors trained the network end to end by Stochastic Gradient Descent (SGD) using pretrained faster R-CNN and LSTM and tested it on benchmark datasets of VQA.

Highlights

  • UNDERSTANDING an image by the help of computer vision or image processing technique is a complex procedure studied in the two last eras

  • Deep learning architectures constructed by knowledge artificial neural networks have enhanced visual image understanding [1, 2, 3].Object recognition from an image is done by Convolutional Neural Network (CNN)

  • This paper focuses on a deep learning based model for both open ended and multiple choice questions

Read more

Summary

Introduction

UNDERSTANDING an image by the help of computer vision or image processing technique is a complex procedure studied in the two last eras. The researchers all over the world applied the process of image understanding to solve the problem of Visual Question Answering by the machine learning. Deep learning architectures constructed by knowledge artificial neural networks have enhanced visual image understanding [1, 2, 3].Object recognition from an image is done by Convolutional Neural Network (CNN). CNN performs feature representation of the image and LSTMs process the representation of question and answer. The researchers directly combined both networks and trained end to end to generate the answer [20, 21]. This kind of approach is able to answer the common and simple questions that are related to the image’s content i.e. This kind of approach is able to answer the common and simple questions that are related to the image’s content i.e. ‘What is the Regular Issue

Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.