Abstract

Prediction of visual saliency in images and video is needed for video understanding, search and retrieval, coding, watermarking and other applications. The majority of prediction models are founded only on “bottom-up” features. Nevertheless, the “top-down” component of human visual attention becomes prevalent as human observers explore the visual scene. Visual saliency which is always a mix of bottom-up and top-down cues can be predicted on the basis of seen data. In this paper, a model of prediction of visual saliency in video on the basis of Deep convolutional neural networks (CNNs) is proposed. A Deep CNN architecture is designed. Various input channels for a CNN architecture are studied: using the known sensitivity of human visual system to residual motion, pixel colour values are completed with residual motion map. The latter is a normalized energy of residual motion in video frames with regard to the estimated global affine motion model. The experiments show that the choice of the input features for the Deep CNN depends on visual task: for highly dynamic content, the proposed model with residual motion is more efficient and gives decent results with relatively shallow Deep architecture.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call