Abstract
Dense video captioning aims to localize and describe multiple events in untrimmed videos, which is a challenging task that draws attention recently in computer vision. Although existing methods have achieved impressive performance, most of them only focus on local information of event segments or very simple event-level context, overlooking the complexity of event-event relationship and the holistic scene. As a result, the coherence of captions within the same video could be damaged. In this article, we propose a novel event-centric hierarchical representation to alleviate this problem. We enhance the event-level representation by capturing rich relationship between events in terms of both temporal structure and semantic meaning. Then, a caption generator with late fusion is developed to generate surrounding-event-aware and topic-aware sentences, conditioned on the hierarchical representation of visual cues from the scene level, the event level, and the frame level. Furthermore, we propose a duplicate removal method, namely temporal-linguistic non-maximum suppression (TL-NMS) to distinguish redundancy in both localization and captioning stages. Quantitative and qualitative evaluations on the ActivityNet Captions and YouCook2 datasets demonstrate that our method improves the quality of generated captions and achieves state-of-the-art performance on most metrics.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
More From: IEEE Transactions on Circuits and Systems for Video Technology
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.