Abstract

Pairing video to natural language description remains a challenge in computer vision and machine translation. Inspired by image description, which uses an encoder–decoder model for reducing visual scene into a single sentence, we propose a deep hierarchical attention network for video description. The proposed model uses convolutional neural network (CNN) and bidirectional LSTM network as encoders while a hierarchical attention network is used as the decoder. Compared to encoder–decoder models used in video description, the bidirectional LSTM network can capture the temporal structure among video frames. Moreover, the hierarchical attention network has an advantage over single-layer attention network on global context modeling. To make a fair comparison with other methods, we evaluate the proposed architecture with different types of CNN structures and decoders. Experimental results on the standard datasets show that our model has a more superior performance than the state-of-the-art techniques.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call