Generating Video Descriptions With Latent Topic Guidance

Shizhe Chen,Qin Jin,Alexander G Hauptmann,Jia Chen

doi:10.1109/tmm.2019.2896515

Abstract

Automatic video description generation (a.k.a video captioning) is one of the ultimate goals for video understanding. Despite the wide range of applications such as video indexing and retrieval etc., the video captioning task remains quite challenging due to the complexity and diversity of video content. First, open-domain videos cover a broad range of topics, which results in highly variable vocabularies and expression styles to describe the video contents. Second, videos naturally contain multiple modalities including image, motion, and acoustic media. The information provided by different modalities differs in different conditions. In this paper, we propose a novel topic-guided video captioning model to address the above-mentioned challenges in video captioning. Our model consists of two joint tasks, namely, latent topic generation and topic-guided caption generation. The topic generation task aims to automatically predict the latent topic of the video. Since there is no groundtruth topic information, we mine multimodal topics in an unsupervised fashion based on video contents and annotated captions, and then distill the topic distribution to a topic prediction model. In the topic-guided generation task, we employ the topic guidance for two purposes. The first is to narrow down the language complexity across topics, where we propose the topic-aware decoder to leverage the latent topics to induce topic-related language models. The decoder is also generic and can be integrated with a temporal attention mechanism. The second is to dynamically attend to important modalities by topics, where we propose a flexible topic-guided multimodal ensemble framework and use the topic gating network to determine the attention weights. The two tasks are correlated with each other, and they collaborate to generate more detailed and accurate video captions. Our extensive experiments on two public benchmark datasets MSR-VTT and Youtube2Text demonstrate the effectiveness of the proposed topic-guided video captioning system, which achieves state-of-the-art performance on both datasets.

Full Text