Video Captioning Based on Channel Soft Attention and Semantic Reconstructor

Zhou Lei,Yiyong Huang

doi:10.3390/fi13020055

Abstract

Video captioning is a popular task which automatically generates a natural-language sentence to describe video content. Previous video captioning works mainly use the encoder–decoder framework and exploit special techniques such as attention mechanisms to improve the quality of generated sentences. In addition, most attention mechanisms focus on global features and spatial features. However, global features are usually fully connected features. Recurrent convolution networks (RCNs) receive 3-dimensional features as input at each time step, but the temporal structure of each channel at each time step has been ignored, which provide temporal relation information of each channel. In this paper, a video captioning model based on channel soft attention and semantic reconstructor is proposed, which considers the global information for each channel. In a video feature map sequence, the same channel of every time step is generated by the same convolutional kernel. We selectively collect the features generated by each convolutional kernel and then input the weighted sum of each channel to RCN at each time step to encode video representation. Furthermore, a semantic reconstructor is proposed to rebuild semantic vectors to ensure the integrity of semantic information in the training process, which takes advantage of both forward (semantic to sentence) and backward (sentence to semantic) flows. Experimental results on popular datasets MSVD and MSR-VTT demonstrate the effectiveness and feasibility of our model.

Highlights

Video captioning is a popular and challenging task
A video captioning model based on channel soft attention and semantic reconstructor (CSA-SR) is proposed
Inspired by soft attention applied in natural-language processing successfully, we propose a channel soft attention to exploit the temporal structure for 3-dimensional image feature maps

Summary

Introduction

Video captioning is a popular and challenging task. It involves both computer vision and natural-language processing. Automatic video caption generation has many practical applications. It could help improve the quality of online video indexing and searching. As another example, in combination with speech synthesis technology, describe videos with natural language could help the visually impaired to understand video contents. The most important part of video captioning is to extract a precise video representation

Methods

Results

Conclusion