Video captioning, a challenging task targeting the automatic generation of accurate and comprehensive descriptions based on video content, has witnessed substantial success recently driven by bridging video representations and textual semantics. Inspired by the nature of the video retrieval task, which learns visual features strongly related to text queries, we propose to take advantage of visual representation learning from the video retrieval framework to tackle video captioning tasks and construct adequate multi-grained cross-modal matching while extracting visual features. However, a simple direct application of recent video retrieval models fails to capture sufficient temporal details and the rich visual features of local patch tokens of video frames lack semantic information essential for captioning tasks. These deficiencies are primarily due to these models lack fine-grained interactions between video frames and offer only weak textual supervision over frame patch tokens. To increase the attention on temporal details, we propose a learnable token shift module, which flexibly captures subtle movements in local regions across the temporal sequence. Furthermore, we devise a Refineformer, which learns to integrate local video patch tokens strongly related to desired captions via a cross-attention mechanism. Extensive experiments on MSVD, MSR-VTT and VATEX demonstrate the favorable performance of our method. Code will be available at https://github.com/tiesanguaixia/IVRC.
Read full abstract