Video paragraph captioning aims to describe a video that contains multiple events with a paragraph of generated coherent sentences. Such a captioning task is full of challenges since the high requirements for visual-textual relevance and semantic coherence across the captioning paragraph of a video. In this work, we introduce a memory-enhanced hierarchical transformer for video paragraph captioning. Our model adopts a hierarchical structure, where the outer layer transformer extracts visual information from a global perspective and captures the relevancy between event segments throughout the entire video, while the inner layer transformer further mines local details within each event segment. By thoroughly exploring both global and local visual information at the video and event levels, our model can provide comprehensive visual feature cues for promising paragraph caption generation. Additionally, we design a memory module to capture similar patterns among event segments within a video, which preserves contextual information across event segments and updates its memory state accordingly. Experimental results on two popular datasets, ActivityNet Captions and YouCook2, demonstrate that our proposed model can achieve superior performance, generating higher quality caption while maintaining consistency in the content of video.
Read full abstract