Abstract

Video caption aims to generate descriptive sentences about the video, and the most critical problem is how to achieve accurate word prediction with standardized and coherent syntax structure, which requires the model to thoroughly understand video content and precisely map them into corresponding sentence components. Many existing methods usually fuse different video features into a single visual feature for generating sentences. However, they ignore the word dataset prior information in the annotations (such as Part-Of-Speech) and they also ignore the association between sentence components and types of visual features. To solve these problems, we propose a POS-trends dynamic-aware model (PDA) to fully exploit the word dataset prior information in the captions to predict POS tag, so as to assist generating captions. We propose a POS feature extraction (PFE) module to use different filters to extract different POS-trends features, predict POS tags and fuse visual features. Furthermore, we propose a visual-dynamic-aware (VDA) module to dynamically adjust the mapping way of words and supplement the visual information into the local features. The fusion features provide directional visual information to generate correct words, and the predicted POS tags to guide the decoding process to generate a more standardized and coherent syntax structure. A large number of experiments based on MSVD, MSR-VTT and VATEX demonstrated that our method outperforms the state-of-the-art methods in BLEU-4, ROUGE-L, METEOR, CIDEr. Code can be available at: <uri xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">https://github.com/WangLanxiao/PDA-for-video-caption</uri> .

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.