A comparative study of language transformers for video question answering

Zekun Yang,Noa Garcia,Chenhui Chu,Mayu Otani,Yuta Nakashima,Haruo Takemura

doi:10.1016/j.neucom.2021.02.092

Zekun Yang, Noa Garcia + Show 4 more

Open Access

https://doi.org/10.1016/j.neucom.2021.02.092

Copy DOI

Journal: Neurocomputing	Publication Date: Mar 10, 2021
Citations: 12	License type: elsevier-specific: oa user license

Affiliation: Osaka University, CyberAgent (Japan)

Abstract

With the goal of correctly answering questions about images or videos, visual question answering (VQA) has quickly developed in recent years. However, current VQA systems mainly focus on answering questions about a single image and face many challenges in answering video-based questions. VQA in video not only has to understand the evolution between video frames but also requires a certain understanding of corresponding subtitles. In this paper, we propose a language Transformer-based video question answering model to encode the complex semantics from video clips. Different from previous models which represent visual features by recurrent neural networks, our model encodes visual concept sequences with a pre-trained language Transformer. We investigate the performance of our model using four language Transformers over two different datasets. The results demonstrate outstanding improvements compared to previous work.

Full Text