Abstract

Partially Relevant Video Retrieval (PRVR) aims to retrieve partially relevant videos from many unlabeled and untrimmed videos according to the query, which is defined as the multiple instance learning problem. The challenge of PRVR is that it utilizes untrimmed videos, which are much closer to reality. The existing methods excavate video-text semantic consistency information insufficiently and lack the capacity to highlight the semantics of key representations. To tackle these issues, we propose a transferable dual multi-granularity semantic excavating network, called T-D3N, to focus on enhancing the learning of dual-modal representations. Specifically, we first introduce a novel transferable textual semantic learning strategy by designing Adaptive Multi-scale Semantic Mining (AMSM) component to excavate significant textual semantic from multiple perspectives. Second, T-D3N distinguishes the feature differences from the frame-wise perspective to better perform contrastive learning between positive and negative samples in the video feature domain, which can further distance the positive and negative samples and improve the probability of positive samples being retrieved by query. Finally, our model constructs multi-grained video temporal dependencies and conducts cross-grained core feature perception, which enables more sufficient multimodal interactions. Extensive experiments are performed on three benchmarks, i.e., ActivityNet Captions, Charades-STA, and TVR, our T-D3N achieves state-of-the-art results. Furthermore, we also confirm that our model is transferable on a broad range of multimodal tasks such as T2VR, VMR, and MMSum.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call