Stitching Segments and Sentences towards Generalization in Video-Text Pre-training

Fan Ma,Jingjia Huang,Linchao Zhu,Yi Yang,Heng Wang,Xiaojie Jin

doi:10.1609/aaai.v38i5.28202

Abstract

Video-language pre-training models have recently achieved remarkable results on various multi-modal downstream tasks. However, most of these models rely on contrastive learning or masking modeling to align global features across modalities, neglecting the local associations between video frames and text tokens. This limits the model’s ability to perform fine-grained matching and generalization, especially for tasks that selecting segments in long videos based on query texts. To address this issue, we propose a novel stitching and matching pre-text task for video-language pre-training that encourages fine-grained interactions between modalities. Our task involves stitching video frames or sentences into longer sequences and predicting the positions of cross-model queries in the stitched sequences. The individual frame and sentence representations are thus aligned via the stitching and matching strategy, encouraging the fine-grained interactions between videos and texts. in the stitched sequences for the cross-modal query. We conduct extensive experiments on various benchmarks covering text-to-video retrieval, video question answering, video captioning, and moment retrieval. Our results demonstrate that the proposed method significantly improves the generalization capacity of the video-text pre-training models.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Stitching Segments and Sentences towards Generalization in Video-Text Pre-training

Abstract

Talk to us

Similar Papers

More From: Proceedings of the AAAI Conference on Artificial Intelligence

Lead the way for us

Journal: Proceedings of the AAAI Conference on Artificial Intelligence	Publication Date: Mar 24, 2024
Citations: 1

Similar Papers

SwinBERT: End-to-End Transformers with Sparse Attention for Video Captioning
Kevin Lin ... Chung-Ching Lin
-
Kevin Lin, et. al.Kevin Lin ... Chung-Ching Lin
01 Jun 2022
01 Jun 2022

Accelerating Video Captioning on Heterogeneous System Architectures
Horng-Ruey Huang ... Wei-Chung Hsu
ACM Transactions on Architecture and Code Optimization | VOL. 19
Horng-Ruey Huang, et. al.Horng-Ruey Huang ... Wei-Chung Hsu
25 May 2022
ACM Transactions on Architecture and Code Optimization | VOL. 19

Efficient Video Captioning on Heterogeneous System Architectures
Horng-Ruey Huang ... Pangfeng Liu
-
Horng-Ruey Huang, et. al.Horng-Ruey Huang ... Pangfeng Liu
01 May 2021
01 May 2021

Hierarchical Representation Network With Auxiliary Tasks for Video Captioning and Video Question Answering.
Lianli Gao ... Meng Wang
IEEE Transactions on Image Processing | VOL. 31
Lianli Gao, et. al.Lianli Gao ... Meng Wang
01 Jan 2021
IEEE Transactions on Image Processing | VOL. 31

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Stitching Segments and Sentences towards Generalization in Video-Text Pre-training

Abstract

Talk to us

Similar Papers

More From: Proceedings of the AAAI Conference on Artificial Intelligence