Abstract

AbstractThere has been emerging interest recently in three dimensional (3D) convolutional neural networks (CNNs) as a powerful tool to encode spatio-temporal representations in videos, by adding a third temporal dimension to pre-existing 2D CNNs. In this chapter, we discuss the effectiveness of using 3D convolutions to capture the important motion features in the context of video saliency prediction. The method filters the spatio-temporal features across multiple adjacent frames. This cubic convolution could be effectively applied on a dense sequence of frames propagating the previous frames’ information into the current, reflecting processing mechanisms of the human visual system for better saliency prediction. We extensively evaluate the model performance compared to the state-of-the-art video saliency models on both 2D and 360\(^\circ \) videos. The architecture can efficiently learn expressive spatio-temporal representations and produce high quality video saliency maps on three large-scale 2D datasets, DHF1K, UCF-SPORTS and DAVIS. Investigations on the 360\(^\circ \) Salient360! and datasets show how the approach can generalise.KeywordsVisual attentionVideo saliencyDeep learning3D CNN

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call