Abstract

As a fundamental problem in visual understanding, video captioning has attracted much attention from both computer vision and natural language processing communities. Despite recent emergence of video captioning methods, how to generate diverse and fine-grained video description is far from being solved. To this end, this work makes the following contributions. First, a novel high-quality video captioning system featuring hierarchical long short-term memory structure and dual-stage loss is designed to translate videos to sentences. Second, we incorporate the convolutional architecture into our captioning system with the aim of generating diverse and fine-grained description. Third, we propose a novel performance evaluation metric named LTMS to assess the fine-grained captions. The experimental results on the benchmark datasets MSVD and MSR-VTT indicate the effectiveness of the proposed model, achieving superior performance over state-of-the-art methods.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call