End-to-end Pre-training with Hierarchical Matching and Momentum Contrast for Text-Video Retrieval.

Wenxue Shen,Heng Tao Shen,Xiaosu Zhu,Jingkuan Song,Gongfu Li

doi:10.1109/tip.2023.3275071

Abstract

Lately, video-language pre-training and text-video retrieval have attracted significant attention with the explosion of multimedia data on the Internet. However, existing approaches for video-language pre-training typically limit the exploitation of the hierarchical semantic information in videos, such as frame semantic information and global video semantic information. In this work, we present an end-to-end pre-training network with Hierarchical Matching and Momentum Contrast named HMMC. The key idea is to explore the hierarchical semantic information in videos via multilevel semantic matching between videos and texts. This design is motivated by the observation that if a video semantically matches a text (can be a title, tag or caption), the frames in this video usually have semantic connections with the text and show higher similarity than frames in other videos. Hierarchical matching is mainly realized by two proxy tasks: Video-Text Matching (VTM) and Frame-Text Matching (FTM). Another proxy task: Frame Adjacency Matching (FAM) is proposed to enhance the single visual modality representations while training from scratch. Furthermore, momentum contrast framework was introduced into HMMC to form a multimodal momentum contrast framework, enabling HMMC to incorporate more negative samples for contrastive learning which contributes to the generalization of representations. We also collected a large-scale Chinese video-language dataset (over 763k unique videos) named CHVTT to explore the multilevel semantic connections between videos and texts. Experimental results on two major Text-video retrieval benchmark datasets demonstrate the advantages of our methods. We release our code at https://github.com/cheetah003/HMMC.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

End-to-end Pre-training with Hierarchical Matching and Momentum Contrast for Text-Video Retrieval.

Abstract

Talk to us

Similar Papers

More From: IEEE Transactions on Image Processing

Lead the way for us

Journal: IEEE Transactions on Image Processing	Publication Date: Jan 1, 2023
Citations: 2

Similar Papers

Traditional Chinese medicine symptom normalization approach leveraging hierarchical semantic information and text matching with attention mechanism
Qi Jia ... Yonghong Xie
Journal of Biomedical Informatics | VOL. 116
Qi Jia, et. al.Qi Jia ... Yonghong Xie
22 Feb 2021
Journal of Biomedical Informatics | VOL. 116

Understanding Chinese Video and Language via Contrastive Multimodal Pre-Training
Chenyi Lei ... Yong Liu
-
Chenyi Lei, et. al.Chenyi Lei ... Yong Liu
17 Oct 2021
17 Oct 2021

SASRT: Semantic-Aware Super-Resolution Transmission for Adaptive Video Streaming over Wireless Multimedia Sensor Networks
Jia Guo ... Xiangyang Gong
Sensors | VOL. 19
Jia Guo, et. al.Jia Guo ... Xiangyang Gong
15 Jul 2019
Sensors | VOL. 19

Towards comprehensive expert finding with a hierarchical matching network
Qiyao Peng ... Minglai Shao
Knowledge-Based Systems | VOL. 257
Qiyao Peng, et. al.Qiyao Peng ... Minglai Shao
30 Sep 2022
Knowledge-Based Systems | VOL. 257

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

End-to-end Pre-training with Hierarchical Matching and Momentum Contrast for Text-Video Retrieval.

Abstract

Talk to us

Similar Papers

More From: IEEE Transactions on Image Processing