STOA-VLP: Spatial-Temporal Modeling of Object and Action for Video-Language Pre-training

Weihong Zhong,Heng Gong,Mao Zheng,Xuan Luo,Duyu Tang,Bing Qin,Xiaocheng Feng

doi:10.1609/aaai.v37i3.25483

Abstract

Although large-scale video-language pre-training models, which usually build a global alignment between the video and the text, have achieved remarkable progress on various downstream tasks, the idea of adopting fine-grained information during the pre-training stage is not well explored. In this work, we propose STOA-VLP, a pre-training framework that jointly models object and action information across spatial and temporal dimensions. More specifically, the model regards object trajectories across frames and multiple action features from the video as fine-grained features. Besides, We design two auxiliary tasks to better incorporate both kinds of information into the pre-training process of the video-language model. The first is the dynamic object-text alignment task, which builds a better connection between object trajectories and the relevant noun tokens. The second is the spatial-temporal action set prediction, which guides the model to generate consistent action features by predicting actions found in the text. Extensive experiments on three downstream tasks (video captioning, text-video retrieval, and video question answering) demonstrate the effectiveness of our proposed STOA-VLP (e.g. 3.7 Rouge-L improvements on MSR-VTT video captioning benchmark, 2.9% accuracy improvements on MSVD video question answering benchmark, compared to previous approaches).

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

STOA-VLP: Spatial-Temporal Modeling of Object and Action for Video-Language Pre-training

Abstract

Talk to us

Similar Papers

More From: Proceedings of the AAAI Conference on Artificial Intelligence

Lead the way for us

Journal: Proceedings of the AAAI Conference on Artificial Intelligence	Publication Date: Jun 26, 2023
Citations: 1

Similar Papers

Advancing DNA Language Models through Motif-Oriented Pre-Training with MoDNA
Weizhi An ... Yatao Bian
BioMedInformatics | VOL. 4
Weizhi An, et. al.Weizhi An ... Yatao Bian
12 Jun 2024
BioMedInformatics | VOL. 4

ActBERT: Learning Global-Local Video-Text Representations
Linchao Zhu ... Yi Yang
-
Linchao Zhu, et. al.Linchao Zhu ... Yi Yang
01 Jun 2020
01 Jun 2020

Hierarchical Representation Network With Auxiliary Tasks for Video Captioning and Video Question Answering.
Lianli Gao ... Meng Wang
IEEE Transactions on Image Processing | VOL. 31
Lianli Gao, et. al.Lianli Gao ... Meng Wang
01 Jan 2021
IEEE Transactions on Image Processing | VOL. 31

Video question answering via grounded cross-attention network learning
Yunan Ye ... Jun Xiao
Information Processing & Management | VOL. 57
Yunan Ye, et. al.Yunan Ye ... Jun Xiao
16 Apr 2020
Information Processing & Management | VOL. 57

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

STOA-VLP: Spatial-Temporal Modeling of Object and Action for Video-Language Pre-training

Abstract

Talk to us

Similar Papers

More From: Proceedings of the AAAI Conference on Artificial Intelligence