Abstract

Recent research progress on the topic of human visual attention allocation in scene perception and its simulation is based mainly on studies with static images. However, natural vision requires us to extract visual information that constantly changes due to egocentric movements or dynamics of the world. It is unclear to what extent spatio-temporal regularity, an inherent regularity in dynamic vision, affects human gaze distribution and saliency computation in visual attention models. In this free-viewing eye-tracking study we manipulated the spatio-temporal regularity of traffic videos by presenting them in normal video sequence, reversed video sequence, normal frame sequence, and randomised frame sequence. The recorded human gaze allocation was then used as the ‘ground truth’ to examine the predictive ability of a number of state-of-the-art visual attention models. The analysis revealed high inter-observer agreement across individual human observers, but all the tested attention models performed significantly worse than humans. The inferior predictability of the models was evident from indistinguishable gaze prediction irrespective of stimuli presentation sequence, and weak central fixation bias. Our findings suggest that a realistic visual attention model for the processing of dynamic scenes should incorporate human visual sensitivity with spatio-temporal regularity and central fixation bias.

Highlights

  • The amount of available visual information in our surroundings is often beyond our brain’s processing capability

  • Written informed consent was obtained from each participant prior to the study, and all procedures complied with the British Psychological Society Code of Ethics and Conduct

  • We selected one participant’s fixation allocations as actual fixations that we wanted to predict, and compared them with the averaged fixation allocations from the remaining participants. By repeating this procedure for all participants and averaging the resulting similarity scores, we obtained a measure for the variability within all human gaze patterns that can serve as an upper bound to the performance of a given computational model

Read more

Summary

Introduction

The amount of available visual information in our surroundings is often beyond our brain’s processing capability. To effectively interact with our natural and social world, we selectively gaze at and process a limited number of local scene regions or visual items that are informative or interesting to us. As our gaze allocation is a sensitive index of attention, motivation and preference (Henderson, 2007), the fixated regions tend to have a distinct subjective perceptual quality which enables them to stand out from the neighbours, and the choice of these salient targets reflects our internal representation of the external world. The central research question in this active visual exploration process is to understand how we choose the fixated regions in the scene. Recent studies have further proposed that fixation selection is uniquely driven by learned associations between stimuli and rewards (Anderson, 2013), and is influenced by aspects of innate human bias, such as the tendency to fixate human/animal face and body in the scene and to look more often at the central part of the scene (Tatler et al, 2011)

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call