Abstract
Dense video captioning is a challenging task with the goal of localizing and describing all events in an untrimmed video, taking into account both visual and text information. Although existing methods have made some achievements, most of them suffer from missing details and inferior captioning. Recent progress has been made in using object features to supplement more detailed information. However, due to the considerable number of objects in the video, the representation of learning objects is often noisy, which may interfere with the generation of correct captions. We also notice that realworld video-text data involve different granularity levels, such as objects/words and events/sentences. Therefore, we propose the hierarchical video-text attention-based encoder-decoder networks for dense video captioning. The proposed method successfully considers the hierarchy in the video and text and exploits the most relevant visual and text features when generating caption. Specially, we design a hierarchical attention encoder for learning complex visual information: an object attention module focusing on the most relevant objects and an event attention module modeling the long-range temporal context. A corresponding decoder has been built for translating multi-level features into the linguistic description, i.e., a word attention module to exploit the most correlated text features and a sentence attention module to leverage high-level semantic information. The proposed hierarchical attention mechanism achieves state-of-the-art performance on the ActivityNet Captions dataset.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.