Abstract

Visual question answering combines the fields of computer vision and natural language processing. It has received much attention in recent years. Image question answering (Image QA) targets to automatically answer questions about visual content of an image. Different from Image QA, video question answering (Video QA) needs to explore a sequence of images to answer the question. It is difficult to focus on the local region features which are related to the question from a sequence of images. In this paper, we propose a forget memory network (FMN) for Video QA to solve this problem. When the forget memory network embeds the video frame features, it can select the local region features that are related to the question and forget the irrelevant features to the question. Then we use the embedded video and question features to predict the answer from multiple-choice answers. Our proposed approaches achieve good performance on the MovieQA [21] and TACoS [28] dataset.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.