As a task of cross-modal retrieval between long videos and paragraphs, video-paragraph retrieval is a non-trivial task. Unlike traditional video-text retrieval, the video in video-paragraph retrieval usually contains multiple clips. Each clip corresponds to a descriptive sentence; all the sentences constitute the corresponding paragraph of the video. Previous methods for video-paragraph retrieval usually encode videos and para-graphs from segment-level (clips and sentences) and overall-level (videos and paragraphs). However, there are also contents about actions and objects that exist in the segment. Hence, we propose a Local-Global Graph Pooling Network (LGGP) via Mutual Information Maximization for video-paragraph retrieval. Our model disentangles videos and paragraphs into four levels: overall-level, segment-level, motion-level, and object-level. We construct the Hierarchical Local Graph (segment-level, motion-level, and object-level) and the Hierarchical Global Graph (overall-level, segment-level, motion-level, and object-level), respectively, for semantic interaction among different levels. Meanwhile, to obtain hierarchical pooling features with fine-grained semantic information, we design hierarchical graph pooling methods to maximize the mutual information between pooling features and corresponding graph nodes. We evaluate our model on two video-paragraph retrieval datasets with three different video features. The experimental results show that our model establishes state-of-the-art results for video-paragraph retrieval. Our code will be released at <uri xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">https://github.com/PengchengZhang1997/LGGP</uri> .
Read full abstract