Abstract
The investigation of visual categorization has recently been aided by the introduction of deep convolutional neural networks (CNNs), which achieve unprecedented accuracy in picture classification after extensive training. Even if the architecture of CNNs is inspired by the organization of the visual brain, the similarity between CNN and human visual processing remains unclear. Here, we investigated this issue by engaging humans and CNNs in a two‐class visual categorization task. To this end, pictures containing animals or vehicles were modified to contain only low/high spatial frequency (HSF) information, or were scrambled in the phase of the spatial frequency spectrum. For all types of degradation, accuracy increased as degradation was reduced for both humans and CNNs; however, the thresholds for accurate categorization varied between humans and CNNs. More remarkable differences were observed for HSF information compared to the other two types of degradation, both in terms of overall accuracy and image‐level agreement between humans and CNNs. The difficulty with which the CNNs were shown to categorize high‐passed natural scenes was reduced by picture whitening, a procedure which is inspired by how visual systems process natural images. The results are discussed concerning the adaptation to regularities in the visual environment (scene statistics); if the visual characteristics of the environment are not learned by CNNs, their visual categorization may depend only on a subset of the visual information on which humans rely, for example, on low spatial frequency information.
Highlights
Making sense of the world and taking appropriate decisions is essential for adaptive behavior and survival
For low-passed pictures, all convolutional neural networks (CNNs) showed a psychometric function, which was shifted toward higher low-pass cutoffs compared with human participants, indicating that less degraded pictures were needed in order to achieve a good categorization performance
In the experiments with nonwhitened pictures examined here, we focus on the performance of CNNs, and observe that in most high spatial frequency (HSF) conditions, accuracy is considerably lower than that of humans, and that the difference compared to humans is larger for HSF than for low-pass filtering (LSF)
Summary
Making sense of the world and taking appropriate decisions is essential for adaptive behavior and survival. Concerning vision, this means making sense of the light and shade which is projected onto the retina. How visual understanding is achieved is the object of study in diverse disciplines, such as general psychology, visual neuroscience, and computer science. These disciplines have shown a converging interest in artificial simulations of vision, namely deep convolutional neural networks (CNNs), which compete with humans in terms of capability to classify visual scenes (e.g., He, Zhang, Ren, & Sun, 2016).
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.