Abstract

Recently, most of the work of generating a text description from a video is based on an Encoder-Decoder framework. Firstly, in the encoder stage, different convolutional neural networks are using to extract features from audio and visual modalities respectively, and then the extracted features are input into the decoder stage, which will use the LSTM to generate the caption of video. Currently, there are two types of work concerns. One is whether video caption will be generated accurately if different multimodal fusion strategies are adopted. Another is whether video caption will be generated more accurately if attention mechanism is added. In this paper, we come up with a fusion framework which combines the two types of methods concerned above to produce a new model. In the encoder stage, two modalities of multimodal fusion, sharing weights and sharing memory are utilized respectively, which can make the two kinds of characteristics resonated to generated the final feature outputs. LSTM with attention mechanism are used in the decoder state to generate a description of video. Our fusion model combining the two methods is well validated on the dataset Microsoft Research Video to Text (MSR-VTT).

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call