Hierarchical Attention-Based Video Captioning Using Key Frames

Munusamy Hemalatha,P Karthik

doi:10.1007/978-981-16-6448-9_30

Abstract

AbstractVideo captioning used in social media like YouTube and Facebook plays a major role for better understanding of the video even when the audio is not clear. In this work, we propose a key frame-based model for video captioning. Here, instead of using all frames in a video only the key frame from the videos are used for video representation. The key frames are extracted from video by comparing images using structural similarity index to identify the difference between the frames and extract only informative frames for the video captioning. We extract the features of the key frames using pre-trained convolutional neural network. We also extract the semantic features from the frames. The key frames are applied to an object detection algorithm to identify the objects and extract the features of the objects. Hierarchical attention is applied on the key frames feature, semantic feature, and the features of the objects identified from the key frames of videos and given as input to LSTM in order to generate the caption for the video.KeywordsVideo captioningSSIMLanguage modelLSTM

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Hierarchical Attention-Based Video Captioning Using Key Frames

Abstract

Talk to us

Similar Papers

Lead the way for us

Similar Papers

Self-Supervised Learning to Detect Key Frames in Videos
Xiang Yan ... Syed Zulqarnain Gilani
Sensors | VOL. 20
Xiang Yan, et. al.Xiang Yan ... Syed Zulqarnain Gilani
04 Dec 2020
Sensors | VOL. 20

Super-Resolution Reconstruction for Mixed Resolution Videos Using Key Frames and Adaptive Detail Warping
Yun-Jhen Chen ... Jin-Jang Leou
-
Yun-Jhen Chen, et. al.Yun-Jhen Chen ... Jin-Jang Leou
20 Nov 2013
20 Nov 2013

Real-time and accurate object detection in compressed video by long short-term feature aggregation
Xinggang Wang ... Chang Huang
Computer Vision and Image Understanding | VOL. 206
Xinggang Wang, et. al.Xinggang Wang ... Chang Huang
05 Mar 2021
Computer Vision and Image Understanding | VOL. 206

DevNet: A Deep Event Network for multimedia event detection and evidence recounting
Chuang Gan ... Dit-Yan Yeung
-
Chuang Gan, et. al.Chuang Gan ... Dit-Yan Yeung
01 Jun 2015
01 Jun 2015

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Hierarchical Attention-Based Video Captioning Using Key Frames

Abstract

Talk to us

Similar Papers