Lightweight recurrent cross-modal encoder for video question answering

Steve Andreas Immanuel,Cheol Jeong

doi:10.1016/j.knosys.2023.110773

Abstract

A video question answering task essentially boils down to how to fuse the information between text and video effectively to predict an answer. Most works employ a transformer encoder as a cross-modal encoder to fuse both modalities by leveraging the full self-attention mechanism. Due to the high computational cost of the self-attention and the high dimensional data of video, they either have to settle for: (1) only training the cross-modal encoder on offline-extracted video and text features or (2) training the cross-modal encoder with the video and text feature extractor, but only using sparsely-sampled video frames. Training only from offline-extracted features suffers from the disconnection between the extracted features and the data of the downstream task because the video and text feature extractors are trained independently on different domains, e.g., action recognition for the video feature extractor and semantic classification for the text feature extractor. Training using sparsely-sampled video frames might suffer from information loss if the video contains very rich information or has many frames. To alleviate those issues, we propose Lightweight Recurrent Cross-modal Encoder (LRCE) that replaces the self-attention operation with a single learnable special token to summarize the text and video features. As a result, our model incurs a significantly lower computational cost. Additionally, we perform a novel multi-segment sampling which sparsely samples the video frames from different segments of the video to provide more fine-grained information. Through extensive experiments on three VideoQA datasets, we demonstrate the LRCE achieves significant performance gains compared to previous works. The code of our proposed method is available at https://github.com/Sejong-VLI/VQA-LRCE-KBS-2023.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Lightweight recurrent cross-modal encoder for video question answering

Abstract

Talk to us

Similar Papers

More From: Knowledge-Based Systems

Lead the way for us

Journal: Knowledge-Based Systems	Publication Date: Jul 5, 2023
Citations: 1

Similar Papers

Remember and forget: video and text fusion for video question answering
Feng Gao ... Yuanyuan Ge
Multimedia Tools and Applications | VOL. 77
Feng Gao, et. al.Feng Gao ... Yuanyuan Ge
27 Mar 2018
Multimedia Tools and Applications | VOL. 77

MuLTI: Efficient Video-and-Language Understanding with Text-Guided MultiWay-Sampler and Multiple Choice Modeling
Jiaqi Xu ... Xing Shi
Proceedings of the AAAI Conference on Artificial Intelligence | VOL. 38
Jiaqi Xu, et. al.Jiaqi Xu ... Xing Shi
24 Mar 2024
Proceedings of the AAAI Conference on Artificial Intelligence | VOL. 38

Classifying derivative works with search, text, audio and video features
Jordan B L Smith ... Masahiro Hamasaki
-
Jordan B L Smith, et. al.Jordan B L Smith ... Masahiro Hamasaki
01 Jul 2017
01 Jul 2017

A Short Video Classification Framework Based on Cross-Modal Fusion.
Nuo Pang ... Songlin Guo
Sensors | VOL. 23
Nuo Pang, et. al.Nuo Pang ... Songlin Guo
12 Oct 2023
Sensors | VOL. 23

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Lightweight recurrent cross-modal encoder for video question answering

Abstract

Talk to us

Similar Papers

More From: Knowledge-Based Systems