Exploring diverse and fine-grained caption for video by incorporating convolutional architecture into LSTM-based model

Huanhou Xiao,Junwei Xu,Jinglun Shi

doi:10.1016/j.patrec.2019.11.003

Abstract

As a fundamental problem in visual understanding, video captioning has attracted much attention from both computer vision and natural language processing communities. Despite recent emergence of video captioning methods, how to generate diverse and fine-grained video description is far from being solved. To this end, this work makes the following contributions. First, a novel high-quality video captioning system featuring hierarchical long short-term memory structure and dual-stage loss is designed to translate videos to sentences. Second, we incorporate the convolutional architecture into our captioning system with the aim of generating diverse and fine-grained description. Third, we propose a novel performance evaluation metric named LTMS to assess the fine-grained captions. The experimental results on the benchmark datasets MSVD and MSR-VTT indicate the effectiveness of the proposed model, achieving superior performance over state-of-the-art methods.

Full Text