Abstract

Video description plays an important role in the field of intelligent imaging technology. Attention perception mechanisms are extensively applied in video description models based on deep learning. Most existing models use a temporal-spatial attention mechanism to enhance the accuracy of models. Temporal attention mechanisms can obtain the global features of a video, whereas spatial attention mechanisms obtain local features. Nevertheless, because each channel of the convolutional neural network (CNN) feature maps has certain spatial semantic information, it is insufficient to merely divide the CNN features into regions and then apply a spatial attention mechanism. In this paper, we propose a temporal-spatial and channel attention mechanism that enables the model to take advantage of various video features and ensures the consistency of visual features between sentence descriptions to enhance the effect of the model. Meanwhile, in order to prove the effectiveness of the attention mechanism, this paper proposes a video visualization model based on the video description. Experimental results show that, our model has achieved good performance on the Microsoft Video Description (MSVD) dataset and a certain improvement on the Microsoft Research-Video to Text (MSR-VTT) dataset.

Highlights

  • Video description is widely used in advanced intelligent technology, including smart city, smart transportation and smart home [1,2,3,4,5]

  • The multi-attention video description model we proposed in this paper is shown in Figure 3 and contains temporal, spatial, and channel attention mechanisms represented by T, S, and C, respectively

  • We proposed a video description model based on temporal-spatial and channel attention

Read more

Summary

Introduction

Video description is widely used in advanced intelligent technology, including smart city, smart transportation and smart home [1,2,3,4,5]. The model weighted every region of each image by using the attention mechanism before each word-predicting process, making the feature used in each prediction different. Based on this idea, Yao [13] proposed a video description model based on a temporal attention mechanism. Yao [13] proposed a video description model based on a temporal attention mechanism Their model weighted the features of all video frames and summed them whenever making word prediction. Our multi-attention video description model introduces the channel attention mechanism on the foundation of a traditional temporal and spatial attention mechanism This model makes a stronger combination of visual features and sentence descriptions so that the accuracy of the model is increased. In this video visualization model, we made a visual analysis of our attention mechanism and proved the accuracy of the model intuitively

Attention Mechanism
Temporal Attention
Spatial Attention
Network Architecture
Attention Calculation
Channel Attention
Attention Visualization
Datasets and Evaluation Metrics
Experiment Setting
Analysis of Different Attention Combinations
Comparison with Methods in Other Papers
Among the models
Visual Analysis and Validation
Conclusions
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call