Efficient End-to-End Video Question Answering with Pyramidal Multimodal Transformer

Min Peng,Yu Shi,Chongyang Wang,Xiang-Dong Zhou

doi:10.1609/aaai.v37i2.25296

Abstract

This paper presents a new method for end-to-end Video Question Answering (VideoQA), aside from the current popularity of using large-scale pre-training with huge feature extractors. We achieve this with a pyramidal multimodal transformer (PMT) model, which simply incorporates a learnable word embedding layer, a few convolutional and transformer layers. We use the anisotropic pyramid to fulfill video-language interactions across different spatio-temporal scales. In addition to the canonical pyramid, which includes both bottom-up and top-down pathways with lateral connections, novel strategies are proposed to decompose the visual feature stream into spatial and temporal sub-streams at different scales and implement their interactions with the linguistic semantics while preserving the integrity of local and global semantics. We demonstrate better or on-par performances with high computational efficiency against state-of-the-art methods on five VideoQA benchmarks. Our ablation study shows the scalability of our model that achieves competitive results for text-to-video retrieval by leveraging feature extractors with reusable pre-trained weights, and also the effectiveness of the pyramid. Code available at: https://github.com/Trunpm/PMT-AAAI23.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Efficient End-to-End Video Question Answering with Pyramidal Multimodal Transformer

Abstract

Talk to us

Similar Papers

More From: Proceedings of the AAAI Conference on Artificial Intelligence

Lead the way for us

Journal: Proceedings of the AAAI Conference on Artificial Intelligence	Publication Date: Jun 26, 2023
Citations: 1

Similar Papers

Video question answering via grounded cross-attention network learning
Yunan Ye ... Jun Xiao
Information Processing & Management | VOL. 57
Yunan Ye, et. al.Yunan Ye ... Jun Xiao
16 Apr 2020
Information Processing & Management | VOL. 57

Video Question Answering via Attribute-Augmented Attention Network Learning
Yunan Ye ... Zhou Zhao
-
Yunan Ye, et. al.Yunan Ye ... Zhou Zhao
07 Aug 2017
07 Aug 2017

Open-Ended Video Question Answering via Multi-Modal Conditional Adversarial Networks.
Zhou Zhao ... Chujie Lu
IEEE Transactions on Image Processing | VOL. 29
Zhou Zhao, et. al.Zhou Zhao ... Chujie Lu
01 Jan 2020
IEEE Transactions on Image Processing | VOL. 29

Video Question Answering via Knowledge-based Progressive Spatial-Temporal Attention Network
-
ACM Transactions on Multimedia Computing, Communications, and Applications | VOL. 15
--
30 Apr 2019
ACM Transactions on Multimedia Computing, Communications, and Applications | VOL. 15

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Efficient End-to-End Video Question Answering with Pyramidal Multimodal Transformer

Abstract

Talk to us

Similar Papers

More From: Proceedings of the AAAI Conference on Artificial Intelligence