Adversarial Multi-Grained Embedding Network for Cross-Modal Text-Video Retrieval

Ning Han,Huanwen Wang,Jingjing Chen,Hao Chen,Hao Zhang

doi:10.1145/3483381

Abstract

Cross-modal retrieval between texts and videos has received consistent research interest in the multimedia community. Existing studies follow a trend of learning a joint embedding space to measure the distance between text and video representations. In common practice, video representation is constructed by feeding clips into 3D convolutional neural networks for a coarse-grained global visual feature extraction. In addition, several studies have attempted to align the local objects of video with the text. However, these representations share a drawback of neglecting rich fine-grained relation features capturing spatial-temporal object interactions that benefits mapping textual entities in the real-world retrieval system. To tackle this problem, we propose an adversarial multi-grained embedding network (AME-Net), a novel cross-modal retrieval framework that adopts both fine-grained local relation and coarse-grained global features in bridging text-video modalities. Additionally, with the newly proposed visual representation, we also integrate an adversarial learning strategy into AME-Net, to further narrow the domain gap between text and video representations. In summary, we contribute AME-Net with an adversarial learning strategy for learning a better joint embedding space, and experimental results on MSR-VTT and YouCook2 datasets demonstrate that our proposed framework consistently outperforms the state-of-the-art method.

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Adversarial Multi-Grained Embedding Network for Cross-Modal Text-Video Retrieval

Abstract

Talk to us

Similar Papers

More From: ACM Transactions on Multimedia Computing, Communications, and Applications

Lead the way for us

Journal: ACM Transactions on Multimedia Computing, Communications, and Applications	Publication Date: Feb 16, 2022
Citations: 8

Similar Papers

BiC-Net: Learning Efficient Spatio-temporal Relation for Text-Video Retrieval
Ning Han ... Chuhao Shi
ACM Transactions on Multimedia Computing, Communications, and Applications | VOL. 20
Ning Han, et. al.Ning Han ... Chuhao Shi
09 Dec 2023
ACM Transactions on Multimedia Computing, Communications, and Applications | VOL. 20

Saliency Guided Discriminative Learning for Insect Pest Recognition
Qiuhong Luo ... Lichao Tian
-
Qiuhong Luo, et. al.Qiuhong Luo ... Lichao Tian
18 Jul 2021
18 Jul 2021

FeatInter: Exploring fine-grained object features for video-text retrieval
Baolong Liu ... Xun Wang
Neurocomputing | VOL. 496
Baolong Liu, et. al.Baolong Liu ... Xun Wang
29 Jan 2022
Neurocomputing | VOL. 496

Similar modality completion-based multimodal sentiment analysis under uncertain missing modalities
Yuhang Sun ... Jian Yu
Information Fusion | VOL. 110
Yuhang Sun, et. al.Yuhang Sun ... Jian Yu
07 May 2024
Information Fusion | VOL. 110

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Adversarial Multi-Grained Embedding Network for Cross-Modal Text-Video Retrieval

Abstract

Talk to us

Similar Papers

More From: ACM Transactions on Multimedia Computing, Communications, and Applications