Abstract

Video captioning refers to the task of generating a natural language sentence that explains the content of the input video clips. This study proposes a deep neural network model for effective video captioning. Apart from visual features, the proposed model learns additionally semantic features that describe the video content effectively. In our model, visual features of the input video are extracted using convolutional neural networks such as C3D and ResNet, while semantic features are obtained using recurrent neural networks such as LSTM. In addition, our model includes an attention-based caption generation network to generate the correct natural language captions based on the multimodal video feature sequences. Various experiments, conducted with the two large benchmark datasets, Microsoft Video Description (MSVD) and Microsoft Research Video-to-Text (MSR-VTT), demonstrate the performance of the proposed model.

Highlights

  • As video data increases, there has been a recent surge of interest in automatic video content analysis

  • A convolutional neural network (CNN) like ResNet [4], VGG [5], and C3D [6] is selected as an encoder for such frameworks, whereas a recurrent neural network (RNN) like long short-term memory (LSTM) [7] is chosen as a decoder

  • This study proposes a deep neural network model, SeFLA (SEmantic Feature Learning and Attention-Based Caption Generation), for effective video captioning by utilizing both visual and semantic features that describe the video content

Read more

Summary

Introduction

There has been a recent surge of interest in automatic video content analysis. A convolutional neural network (CNN) like ResNet [4], VGG [5], and C3D [6] is selected as an encoder for such frameworks, whereas a recurrent neural network (RNN) like LSTM [7] is chosen as a decoder They considered frame features of the video without any particular focus. This study proposes a deep neural network model, SeFLA (SEmantic Feature Learning and Attention-Based Caption Generation), for effective video captioning by utilizing both visual and semantic features that describe the video content. The proposed model adopts an attention-based mechanism that determines which semantic feature to focus on at every time step to generate correct captions effectively based on the multimodal video features. Experiments are run using the Microsoft Video Description (MSVD) [14] and Microsoft Research Video-to-Text (MSRVTT) [15] datasets, following which the results are discussed

Related Work
Video Captioning Model
Performance Evaluation
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call