Dense Video Captioning with Hierarchical Attention-Based Encoder-Decoder Networks

Mingjing Yu,Huicheng Zheng,Zehua Liu

doi:10.1109/ijcnn52387.2021.9534164

Abstract

Dense video captioning is a challenging task with the goal of localizing and describing all events in an untrimmed video, taking into account both visual and text information. Although existing methods have made some achievements, most of them suffer from missing details and inferior captioning. Recent progress has been made in using object features to supplement more detailed information. However, due to the considerable number of objects in the video, the representation of learning objects is often noisy, which may interfere with the generation of correct captions. We also notice that realworld video-text data involve different granularity levels, such as objects/words and events/sentences. Therefore, we propose the hierarchical video-text attention-based encoder-decoder networks for dense video captioning. The proposed method successfully considers the hierarchy in the video and text and exploits the most relevant visual and text features when generating caption. Specially, we design a hierarchical attention encoder for learning complex visual information: an object attention module focusing on the most relevant objects and an event attention module modeling the long-range temporal context. A corresponding decoder has been built for translating multi-level features into the linguistic description, i.e., a word attention module to exploit the most correlated text features and a sentence attention module to leverage high-level semantic information. The proposed hierarchical attention mechanism achieves state-of-the-art performance on the ActivityNet Captions dataset.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Dense Video Captioning with Hierarchical Attention-Based Encoder-Decoder Networks

Abstract

Talk to us

Similar Papers

Lead the way for us

Similar Papers

CAM-RNN: Co-Attention Model Based RNN for Video Captioning.
Bin Zhao ... Xiaoqiang Lu
IEEE Transactions on Image Processing | VOL. 28
Bin Zhao, et. al.Bin Zhao ... Xiaoqiang Lu
20 May 2019
IEEE Transactions on Image Processing | VOL. 28

Using heterogeneous annotation and visual information for the benchmarking of image retrieval systems
Henning Müller ... Bruno Janvier
-
Henning Müller, et. al.Henning Müller ... Bruno Janvier
15 Jan 2006
15 Jan 2006

Influences of narcissism and parental mediation on adolescents' textual and visual personal information disclosure in Facebook
Cong Liu ... May O Lwin
Computers in Human Behavior | VOL. 58
Cong Liu, et. al.Cong Liu ... May O Lwin
02 Jan 2015
Computers in Human Behavior | VOL. 58

Cognitive and Perceptual Functions of the Visual Thalamus
Yuri B Saalmann ... Sabine Kastner
Neuron | VOL. 71
Yuri B Saalmann, et. al.Yuri B Saalmann ... Sabine Kastner
01 Jul 2011
Neuron | VOL. 71

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Dense Video Captioning with Hierarchical Attention-Based Encoder-Decoder Networks

Abstract

Talk to us

Similar Papers