Fixation Selection for Categorical Target Searches in Real-World Scenes

Nicholas Kleene,Melchi Michel

doi:10.1167/18.10.523

Abstract

Bayesian observer models have seen widespread success in predicting fixation locations for visual search tasks using artificial stimuli, such as Gabors in 1/f noise (Najemnik & Geisler, 2005), but comparatively little in predicting fixation locations for searches for natural categorical targets in real-world scenes. Critically, previous approaches have not accounted for the effects of foveation nor implemented decision rules to select a sequence of fixations. Here we present a Bayesian model of fixation selection in visual search tasks using natural images. The model used two known sources of information to select fixations: scene context and the spatial distribution of target-like features (the target-relevant feature distribution). Scene context functions as a prior over potential target locations, while the target-relevant feature distribution is used to compute the likelihood function. In line with previous approaches (Torralba, Oliva, Castelhano & Henderson, 2006; Ehinger, Hidalgo-Sotelo, Torralba & Oliva, 2009), we use GIST features to characterize scene context and we use Histograms of Oriented Gradients (Dalal & Triggs, 2005) to characterize target-relevant features. In addition, we account for the effects of foveation and combine these information sources with a sequential Bayesian updating framework (similar to Najemnik & Geisler, 2005). Using scene context or the target-relevant feature distribution alone, our model performs quite well (greater than 90% localization performance and greater than 95% classification performance, respectively). To compare our model's fixation selections, we tested human observers on a pedestrian search task in natural images. Before the search task, visibility maps were measured for each human observer. These visibility maps were used to degrade the target-relevant feature information in our model simulations, replicating the effects of foveation. We compare the model's selected fixations with those of human observers and discuss the implications regarding what information humans should use, and how they should combine it when selecting fixations. Meeting abstract presented at VSS 2018

Full Text