Training for object recognition with increasing spatial frequency: A comparison of deep learning with human vision.

Lev Kiar Avberšek,Astrid Zeman,Hans Op de Beeck

doi:10.1167/jov.21.10.14

Lev Kiar Avberšek, Astrid Zeman + Show 1 more

Open Access

https://doi.org/10.1167/jov.21.10.14

Copy DOI

Journal: Journal of vision	Publication Date: Sep 17, 2021
Citations: 12	License type: CC BY 4.0

Affiliation: KU Leuven, University of Ljubljana

Abstract

The ontogenetic development of human vision and the real-time neural processing of visual input exhibit a striking similarity—a sensitivity toward spatial frequencies that progresses in a coarse-to-fine manner. During early human development, sensitivity for higher spatial frequencies increases with age. In adulthood, when humans receive new visual input, low spatial frequencies are typically processed first before subsequent processing of higher spatial frequencies. We investigated to what extent this coarse-to-fine progression might impact visual representations in artificial vision and compared this to adult human representations. We simulated the coarse-to-fine progression of image processing in deep convolutional neural networks (CNNs) by gradually increasing spatial frequency information during training. We compared CNN performance after standard and coarse-to-fine training with a wide range of datasets from behavioral and neuroimaging experiments. In contrast to humans, CNNs that are trained using the standard protocol are very insensitive to low spatial frequency information, showing very poor performance in being able to classify such object images. By training CNNs using our coarse-to-fine method, we improved the classification accuracy of CNNs from 0% to 32% on low-pass-filtered images taken from the ImageNet dataset. The coarse-to-fine training also made the CNNs more sensitive to low spatial frequencies in hybrid images with conflicting information in different frequency bands. When comparing differently trained networks on images containing full spatial frequency information, we saw no representational differences. Overall, this integration of computational, neural, and behavioral findings shows the relevance of the exposure to and processing of inputs with variation in spatial frequency content for some aspects of high-level object representations.

Highlights

The role of spatial frequency has been extensively researched in the development of human vision, with much effort being directed toward visual acuity and contrast sensitivity (Banks & Salapatek, 1978; Benedek, Benedek, Kéri, & Janáky, 2003; Ellemberg, Lewis, Liu, & Maurer, 1999; Leat, Yadav, & Irving, 2009; Mayer & Dobson, 1982; Norcia & Tyler, 1985; Norcia, Tyler, & Hamer, 1990; Peterzell, Werner, & Kaplan, 1995; Stiers, Vanderkelen, & Vandenbussche, 2003)
Accuracy increased with images containing higher spatial frequency information (21% for spatial frequency levels (SFLs) 2, 36% for SFL 3, and 48% for SFL 4)
We again observed a very rapid increase in performance of MobileNet after the 100th epoch for images containing higher SF information; yet, there was no drop in performance for images containing the lowest spatial frequency information, which instead converged further

Summary

Introduction

The role of spatial frequency has been extensively researched in the development of human vision, with much effort being directed toward visual acuity and contrast sensitivity (Banks & Salapatek, 1978; Benedek, Benedek, Kéri, & Janáky, 2003; Ellemberg, Lewis, Liu, & Maurer, 1999; Leat, Yadav, & Irving, 2009; Mayer & Dobson, 1982; Norcia & Tyler, 1985; Norcia, Tyler, & Hamer, 1990; Peterzell, Werner, & Kaplan, 1995; Stiers, Vanderkelen, & Vandenbussche, 2003). Visual acuity can be classified into recognition (perceived detail) and resolution (the separation between dots or gratings, or spatial frequency) that a person can successfully resolve.

Methods

Results

Conclusion