Abstract

Estimating the focus of attention of a person looking at an image or a video is a crucial step which can enhance many vision-based inference mechanisms: image segmentation and annotation, video captioning, autonomous driving are some examples. The early stages of the attentive behavior are typically bottom-up; reproducing the same mechanism means to find the saliency embodied in the images, i.e. which parts of an image pop out of a visual scene. This process has been studied for decades in neuroscience and in terms of computational models for reproducing the human cortical process. In the last few years, early models have been replaced by deep learning architectures, that outperform any early approach compared against public datasets. In this paper, we propose a discussion on why convolutional neural networks (CNNs) are so accurate in saliency prediction. We present our DL architectures which combine both bottom-up cues and higher-level semantics, and incorporate the concept of time in the attentional process through LSTM recurrent architectures. Eventually, we present a video-specific architecture based on the C3D network, which can extracts spatio-temporal features by means of 3D convolutions to model task-driven attentive behaviors. The merit of this work is to show how these deep networks are not mere brute-force methods tuned on massive amount of data, but represent well-defined architectures which recall very closely the early saliency models, although improved with the semantics learned by human ground-truth.

Highlights

  • When humans look around the world, observing an image or watching at a video sequence, attentive mechanisms drive their gazes towards salient regions

  • We present an overview of different solutions that we have developed for saliency prediction on images and video with Deep Learning (DL), which represent the state-of-the-art in public available benchmarks

  • We report the results of the Multi-Level Network (ML-Net) model, that was originally proposed for image saliency and has been properly trained from scratch on the DR(eye)VE dataset

Read more

Summary

Introduction

When humans look around the world, observing an image or watching at a video sequence, attentive mechanisms drive their gazes towards salient regions. The control of attention combines some stimuli processed in different cortical areas to mix spatial localization and recognition tasks, integrating datadriven pop outs and some learned semantics It has a temporal evolution, since some mechanisms such as the inhibition of return and the control of eye movements allow humans to refine attention during time. When watching a video sequence, instead, static visual features have lower importance while motion gains a crucial role, motivating the need of different solutions for static images and video In both scenarios, computational models capable of identifying salient regions can enhance many vision-based inference mechanisms, ranging from image captioning [11] to video compression [13]. A solution for video saliency prediction will be discussed and analyzed in the case of driver attention estimation

Related Work
Saliency prediction in video
Saliency Prediction with Deep Learning Architectures
Incorporating low-level and high-level cues in a Multi-Level Network
Saliency map refinement via a convolutional recurrent architecture
Estimating task-driven saliency in videos
Conclusions
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call