Abstract

AbstractIn visual question answering (VQA) task, we generally use a convolutional neural network (CNN) to extract image features, a recurrent neural network (RNN) for question representation. Instead of using a RNN for language representation, we use a CNN for question representation. The advantage of using a CNN for question representation is that CNN is more effective in capturing image and question words interactions which are not expressed by a RNN. Thus, in this paper, we have used three CNNs for VQA task. One CNN extracts visual features, second CNN for extracting question features, and third CNN for combining both extracted feature vectors. Further, we have employed a softmax layer to generate answer for a given question. The proposed VQA model is evaluated on DAQUAR, COCO-QA, and VQA2.0 datasets.KeywordsCNNRNNDAQUARCOCO-QAVQA2.0

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call