Event-Centric Hierarchical Representation for Dense Video Captioning

Teng Wang,Mingjing Yu,Huicheng Zheng,Qian Tian,Haifeng Hu

doi:10.1109/tcsvt.2020.3014606

Abstract

Dense video captioning aims to localize and describe multiple events in untrimmed videos, which is a challenging task that draws attention recently in computer vision. Although existing methods have achieved impressive performance, most of them only focus on local information of event segments or very simple event-level context, overlooking the complexity of event-event relationship and the holistic scene. As a result, the coherence of captions within the same video could be damaged. In this article, we propose a novel event-centric hierarchical representation to alleviate this problem. We enhance the event-level representation by capturing rich relationship between events in terms of both temporal structure and semantic meaning. Then, a caption generator with late fusion is developed to generate surrounding-event-aware and topic-aware sentences, conditioned on the hierarchical representation of visual cues from the scene level, the event level, and the frame level. Furthermore, we propose a duplicate removal method, namely temporal-linguistic non-maximum suppression (TL-NMS) to distinguish redundancy in both localization and captioning stages. Quantitative and qualitative evaluations on the ActivityNet Captions and YouCook2 datasets demonstrate that our method improves the quality of generated captions and achieves state-of-the-art performance on most metrics.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Event-Centric Hierarchical Representation for Dense Video Captioning

Abstract

Talk to us

Similar Papers

More From: IEEE Transactions on Circuits and Systems for Video Technology

Lead the way for us

Journal: IEEE Transactions on Circuits and Systems for Video Technology	Publication Date: Aug 13, 2020
Citations: 84

Similar Papers

A latent topic‐aware network for dense video captioning
Tao Xu ... Yuanyuan Cui
IET Computer Vision | VOL. 17
Tao Xu, et. al.Tao Xu ... Yuanyuan Cui
29 Aug 2023
IET Computer Vision | VOL. 17

Context-aware network with foreground recalibration for grounding natural language in video
Cheng Chen ... Xiaodong Gu
Neural Computing and Applications | VOL. 33
Cheng Chen, et. al.Cheng Chen ... Xiaodong Gu
26 Feb 2021
Neural Computing and Applications | VOL. 33

Bidirectional Temporal Context Fusion with Bi-Modal Semantic Features using a gating mechanism for Dense Video Captioning
Noorhan Khaled ... Mohammed Marey
International Journal of Intelligent Computing and Information Sciences | VOL. 21
Noorhan Khaled, et. al.Noorhan Khaled ... Mohammed Marey
18 Jul 2021
International Journal of Intelligent Computing and Information Sciences | VOL. 21

End-to-End Dense Video Captioning with Masked Transformer
Luowei Zhou ... Caiming Xiong
-
Luowei Zhou, et. al.Luowei Zhou ... Caiming Xiong
01 Jun 2018
01 Jun 2018

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Event-Centric Hierarchical Representation for Dense Video Captioning

Abstract

Talk to us

Similar Papers

More From: IEEE Transactions on Circuits and Systems for Video Technology