Text-video retrieval, a fundamental task for associating textual descriptions with video content, has become increasingly important in the video domain. Most existing methods focus on the single-modality features only considering the knowledge within individual video or text modalities, often neglecting cross-modal interactions. However, a text description corresponds to a specific spatio-temporal content within a video, involving a certain segment of a frame sequence and distinct sub-regions within these frames. Therefore, we focus on the text-conditioned video features to bridge the modality gap. In this paper, we propose Spatio-Temporal Attention for video-text retrieval, termed STAttn, which utilizes textual information to focus on the spatio-temporal video content. Our final text-conditioned video features are generated from the text-related video frames and the text-related regions within these frames. First, we propose the Spatial Text-Attention Module (STAM) to learn the spatial information within video frames. STAM introduces the text-related salient patches to capture more fine-grained details. Second, we propose the Temporal Text-Attention Module (TTAM) to learn the temporal relationships between video frames. Temporal Triplet loss is proposed in TTAM to enhance the attention towards the text-related frames. Thus, the two modules learn the text-related spatio-temporal content from both intra-frame and inter-frame aspects. Extensive experiments on three benchmark datasets, MSRVTT, ActivityNet, and DiDeMo, demonstrate that our STAttn outperforms state-of-the-art methods.
Read full abstract