CNN-based search model fails to account for human attention guidance by simple visual features.

Endel Põder

doi:10.3758/s13414-023-02697-2

Abstract

Recently, Zhang et al. (Nature communications, 9(1), 3730, 2018) proposed an interesting model of attention guidance that uses visual features learnt by convolutional neural networks(CNNs) for object classification. I adapted this model for search experiments, with accuracy as the measure of performance. Simulation of our previously published feature and conjunction search experiments revealed that the CNN-based search model proposed by Zhang et al. considerably underestimates human attention guidance by simple visual features. Using target-distractor differences instead of target features for attention guidance or computing attention map at lower layers of the network could improve the performance. Still, the model fails to reproduce qualitative regularities of human visual search. The most likely explanation is that standard CNNs that are trained on image classification have not learnt medium- or high-level features required for human-like attention guidance.

Full Text