Abstract

Video captioning aims to briefly describe the content of a video in accurate and fluent natural language, which is a hot research topic in multimedia processing. As a bridge between video and natural language, video captioning is a challenging task that requires a deep understanding of video content and effective utilization of diverse video multimodal information. Existing video captioning methods usually ignore the relative importance between different frames when aggregating frame-level video features and neglect the global semantic correlations between videos and texts in learning visual representations, resulting in the learned representations less effective. To address these problems, we propose a novel framework, namely Global Semantic Enhancement Network (GSEN) to generate high-quality captions for videos. Specifically, a feature aggregation module based on a lightweight attention mechanism is designed to aggregate frame-level video features, which highlights features of informative frames in video representations. In addition, a global semantic enhancement module is proposed to enhance semantic correlations for video and language representations in order to generate semantically more accurate captions. Extensive qualitative and quantitative experiments on two public benchmark datasets MSVD and MSR-VTT demonstrate that the proposed GSEN can achieve superior performance than state-of-the-art methods.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.