Abstract

AbstractVideo captioning used in social media like YouTube and Facebook plays a major role for better understanding of the video even when the audio is not clear. In this work, we propose a key frame-based model for video captioning. Here, instead of using all frames in a video only the key frame from the videos are used for video representation. The key frames are extracted from video by comparing images using structural similarity index to identify the difference between the frames and extract only informative frames for the video captioning. We extract the features of the key frames using pre-trained convolutional neural network. We also extract the semantic features from the frames. The key frames are applied to an object detection algorithm to identify the objects and extract the features of the objects. Hierarchical attention is applied on the key frames feature, semantic feature, and the features of the objects identified from the key frames of videos and given as input to LSTM in order to generate the caption for the video.KeywordsVideo captioningSSIMLanguage modelLSTM

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.