DI-VTR: Dual inter-modal interaction model for video-text retrieval

Jie Guo,Mengying Wang,Wenwei Wang,Yan Zhou,Bin Song

doi:10.1016/j.jiixd.2024.03.003

Abstract

Video-text retrieval is a challenging task for multimodal information processing due to the semantic gap between different modalities. However, most existing methods do not fully mine the intra-modal interactions, as with the temporal correlation of video frames, which results in poor matching performance. Additionally, the imbalanced semantic information between videos and texts also leads to difficulty in the alignment of the two modalities. To this end, we propose a dual inter-modal interaction network for video-text retrieval, i.e., DI-VTR. To learn the intra-modal interaction of video frames, we design a contextual-related video encoder to obtain more fine-grained content-oriented video representations. We also propose a dual inter-modal interaction module to accomplish accurate multilingual alignment between the video and text modalities by introducing multilingual text to improve the representation ability of text semantic features. Extensive experimental results on commonly-used video-text retrieval datasets, including MSR-VTT, MSVD and VATEX, show that the proposed method achieves significantly improved performance compared with state-of-the-art methods.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

DI-VTR: Dual inter-modal interaction model for video-text retrieval

Abstract

Talk to us

Similar Papers

More From: Journal of Information and Intelligence

Lead the way for us

Journal: Journal of Information and Intelligence	Publication Date: Mar 1, 2024
License type: cc-by-nc-nd

Similar Papers

Towards Developing a Multi-Modal Video Recommendation System
Sriram Pingali ... Prabir Mondal
-
Sriram Pingali, et. al.Sriram Pingali ... Prabir Mondal
18 Jul 2022
18 Jul 2022

Multi-Modal fake news Detection on Social Media with Dual Attention Fusion Networks
Haitian Yang ... Xuan Zhao
-
Haitian Yang, et. al.Haitian Yang ... Xuan Zhao
05 Sep 2021
05 Sep 2021

HGAN: Hierarchical Graph Alignment Network for Image-Text Retrieval
Jie Guo ... Meiting Wang
IEEE Transactions on Multimedia | VOL. 25
Jie Guo, et. al.Jie Guo ... Meiting Wang
01 Jan 2023
IEEE Transactions on Multimedia | VOL. 25

DLI-Net: Dual Local Interaction Network for Fine-Grained Sketch-Based Image Retrieval
Haifeng Sun ... Jianxin Liao
IEEE Transactions on Circuits and Systems for Video Technology | VOL. 32
Haifeng Sun, et. al.Haifeng Sun ... Jianxin Liao
01 Oct 2022
IEEE Transactions on Circuits and Systems for Video Technology | VOL. 32

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

DI-VTR: Dual inter-modal interaction model for video-text retrieval

Abstract

Talk to us

Similar Papers

More From: Journal of Information and Intelligence