Abstract

Video captioning is considered to be challenging due to the combination of video understanding and text generation. Recent progress in video captioning has been made mainly using methods of visual feature extraction and sequential learning. However, the syntax structure and semantic consistency of generated captions are not fully explored. Thus, in our work, we propose a novel multimodal attention based framework with Part-of-Speech (POS) sequence guidance to generate more accu-rate video captions. In general, the word sequence generation and POS sequence prediction are hierarchically jointly modeled in the framework. Specifically, different modalities including visual, motion, object and syntactic features are adaptively weighted and fused with the POS guided attention mechanism when computing the probability distributions of prediction words. Experimental results on two benchmark datasets, i.e. MSVD and MSR-VTT, demonstrate that the proposed method can not only fully exploit the information from video and text content, but also focus on the decisive feature modality when generating a word with a certain POS type. Thus, our approach boosts the video captioning performance as well as generating idiomatic captions.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call