Convolutional Neural Networks-Based VQA Model

Himanshu Sharma,Anand Singh Jalal

doi:10.1007/978-981-19-0105-8_11

Abstract

AbstractIn visual question answering (VQA) task, we generally use a convolutional neural network (CNN) to extract image features, a recurrent neural network (RNN) for question representation. Instead of using a RNN for language representation, we use a CNN for question representation. The advantage of using a CNN for question representation is that CNN is more effective in capturing image and question words interactions which are not expressed by a RNN. Thus, in this paper, we have used three CNNs for VQA task. One CNN extracts visual features, second CNN for extracting question features, and third CNN for combining both extracted feature vectors. Further, we have employed a softmax layer to generate answer for a given question. The proposed VQA model is evaluated on DAQUAR, COCO-QA, and VQA2.0 datasets.KeywordsCNNRNNDAQUARCOCO-QAVQA2.0

Full Text