Abstract
A video question answering task essentially boils down to how to fuse the information between text and video effectively to predict an answer. Most works employ a transformer encoder as a cross-modal encoder to fuse both modalities by leveraging the full self-attention mechanism. Due to the high computational cost of the self-attention and the high dimensional data of video, they either have to settle for: (1) only training the cross-modal encoder on offline-extracted video and text features or (2) training the cross-modal encoder with the video and text feature extractor, but only using sparsely-sampled video frames. Training only from offline-extracted features suffers from the disconnection between the extracted features and the data of the downstream task because the video and text feature extractors are trained independently on different domains, e.g., action recognition for the video feature extractor and semantic classification for the text feature extractor. Training using sparsely-sampled video frames might suffer from information loss if the video contains very rich information or has many frames. To alleviate those issues, we propose Lightweight Recurrent Cross-modal Encoder (LRCE) that replaces the self-attention operation with a single learnable special token to summarize the text and video features. As a result, our model incurs a significantly lower computational cost. Additionally, we perform a novel multi-segment sampling which sparsely samples the video frames from different segments of the video to provide more fine-grained information. Through extensive experiments on three VideoQA datasets, we demonstrate the LRCE achieves significant performance gains compared to previous works. The code of our proposed method is available at https://github.com/Sejong-VLI/VQA-LRCE-KBS-2023.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.