Video Captioning by Learning from Global Sentence and Looking Ahead

Tian-Zi Niu,Zhen-Duo Chen,Zi Huang,Xin Luo,Xin-Shun Xu,Peng-Fei Zhang

doi:10.1145/3587252

Abstract

Video captioning aims to automatically generate natural language sentences describing the content of a video. Although encoder-decoder-based models have achieved promising progress, it is still very challenging to effectively model the linguistic behavior of humans in generating video captions. In this paper, we propose a novel video captioning model by learning from gLobal sEntence and looking AheaD, LEAD for short. Specifically, LEAD consists of two modules: a Vision Module (VM) and a Language Module (LM) . Thereinto, VM is a novel attention network, which can map visual features to high-level language space and model entire sentences explicitly. LM can not only effectively make use of the information of the previous sequence when generating the current word, but also have a look at the future word. Therefore, based on VM and LM, LEAD can obtain global sentence information and future word information to make video captioning more like a fill-in-the-blank task than a word-by-word sentence generation. In addition, we also propose an autonomous strategy and a multi-stage training scheme to optimize the model, which can mitigate the problem of information leakage. Extensive experiments show that LEAD outperforms some state-of-the-art methods on MSR-VTT, MSVD, and VATEX, demonstrating the effectiveness of the proposed approach in video captioning. In addition, we release the code of our proposed model to be publicly available. 1

Full Text