A dual foveal-peripheral visual processing model implements efficient saccade selection.

Emmanuel Daucé,Laurent U Perrinet,Pierre Albiges

doi:10.1167/jov.20.8.22

Emmanuel Daucé, Laurent U Perrinet + Show 1 more

Open Access

PDF Available

https://doi.org/10.1167/jov.20.8.22

Copy DOI

Export

Save

Cite

Abstract
Highlights/Summary
Full-Text PDF
Similar Papers

Abstract

Listen

We develop a visuomotor model that implements visual search as a focal accuracy-seeking policy, with the target's position and category drawn independently from a common generative process. Consistently with the anatomical separation between the ventral versus dorsal pathways, the model is composed of two pathways that respectively infer what to see and where to look. The “What” network is a classical deep learning classifier that only processes a small region around the center of fixation, providing a “foveal” accuracy. In contrast, the “Where” network processes the full visual field in a biomimetic fashion, using a log-polar retinotopic encoding, which is preserved up to the action selection level. In our model, the foveal accuracy is used as a monitoring signal to train the “Where” network, much like in the “actor/critic” framework. After training, the “Where” network provides an “accuracy map” that serves to guide the eye toward peripheral objects. Finally, the comparison of both networks’ accuracies amounts to either selecting a saccade or keeping the eye focused at the center to identify the target. We test this setup on a simple task of finding a digit in a large, cluttered image. Our simulation results demonstrate the effectiveness of this approach, increasing by one order of magnitude the radius of the visual field toward which the agent can detect and recognize a target, either through a single saccade or with multiple ones. Importantly, our log-polar treatment of the visual information exploits the strong compression rate performed at the sensory level, providing ways to implement visual search in a sublinear fashion, in contrast with mainstream computer vision.

Highlights

Problem statementThe field of computer vision was recently recast by the outstanding capability of convolution-based deep neural networks to capture the semantic content of images and photographs
We use the MNIST data set of handwritten digits introduced by Lecun et al (1998): Samples are drawn from the data set of 60,000 grayscale, 28 × 28 pixel images
The texture is designed to fit with the statistics of natural images

Summary

Introduction

Problem statementThe field of computer vision was recently recast by the outstanding capability of convolution-based deep neural networks to capture the semantic content of images and photographs. We chose an isotropic setting where textures are characterized by solely two parameters, one controlling the median spatial frequency of the noise, the other controlling the bandwidth around the central frequency. This can be considered the band-pass filtering of a random white noise image. The spatial frequency is set at 0.1 pixel−1 to fit that of the original digits This specific spatial frequency occasionally allows one to generate some “phantom” digit shapes in the background. These images are rectified to have a normalized contrast

Methods

Results

Conclusion