Abstract

Deep Convolutional Neural Networks (CNNs) are gaining traction as the benchmark model of visual object recognition, with performance now surpassing humans. While CNNs can accurately assign one image to potentially thousands of categories, network performance could be the result of layers that are tuned to represent the visual shape of objects, rather than object category, since both are often confounded in natural images. Using two stimulus sets that explicitly dissociate shape from category, we correlate these two types of information with each layer of multiple CNNs. We also compare CNN output with fMRI activation along the human visual ventral stream by correlating artificial with neural representations. We find that CNNs encode category information independently from shape, peaking at the final fully connected layer in all tested CNN architectures. Comparing CNNs with fMRI brain data, early visual cortex (V1) and early layers of CNNs encode shape information. Anterior ventral temporal cortex encodes category information, which correlates best with the final layer of CNNs. The interaction between shape and category that is found along the human visual ventral pathway is echoed in multiple deep networks. Our results suggest CNNs represent category information independently from shape, much like the human visual system.

Highlights

  • While these feats are impressive, it is unclear to what extent these results are interpretable in terms of categorical representations

  • For Set A, we found a significant correlation between the behavioural models for shape and category (Spearman’s ρ = 0.4753, p < 0.001 permutation test with 1000 randomisations of stimulus labels) and so partial correlations were performed when carrying out Representational Similarity Analysis (RSA) with Set A models

  • Using GIST31 descriptors of each image and combining this with Linear Discriminant Analysis (LDA), we confirmed that category could not be predicted based upon these low-level descriptors whereas shape could, demonstrating that our stimulus sets were properly orthogonalised

Read more

Summary

Introduction

While these feats are impressive, it is unclear to what extent these results are interpretable in terms of categorical representations. CNNs rely more heavily upon local shape information for classification, known as texture-bias, which may potentially cause a greater discrepancy in performance than shape-bias[21]. Given that these networks are adept at representing object shape, to a degree that may even be greater than humans, it is possible they are taking advantage of shape-based features, instead of category information, to classify object images. We compare artificial representations with human fMRI responses for the same two stimulus sets, to evaluate how closely CNNs reflect biological representations

Objectives
Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call