Abstract

View-invariant object recognition is a challenging problem that has attracted much attention among the psychology, neuroscience, and computer vision communities. Humans are notoriously good at it, even if some variations are presumably more difficult to handle than others (e.g., 3D rotations). Humans are thought to solve the problem through hierarchical processing along the ventral stream, which progressively extracts more and more invariant visual features. This feed-forward architecture has inspired a new generation of bio-inspired computer vision systems called deep convolutional neural networks (DCNN), which are currently the best models for object recognition in natural images. Here, for the first time, we systematically compared human feed-forward vision and DCNNs at view-invariant object recognition task using the same set of images and controlling the kinds of transformation (position, scale, rotation in plane, and rotation in depth) as well as their magnitude, which we call “variation level.” We used four object categories: car, ship, motorcycle, and animal. In total, 89 human subjects participated in 10 experiments in which they had to discriminate between two or four categories after rapid presentation with backward masking. We also tested two recent DCNNs (proposed respectively by Hinton's group and Zisserman's group) on the same tasks. We found that humans and DCNNs largely agreed on the relative difficulties of each kind of variation: rotation in depth is by far the hardest transformation to handle, followed by scale, then rotation in plane, and finally position (much easier). This suggests that DCNNs would be reasonable models of human feed-forward vision. In addition, our results show that the variation levels in rotation in depth and scale strongly modulate both humans' and DCNNs' recognition performances. We thus argue that these variations should be controlled in the image datasets used in vision research.

Highlights

  • As our viewpoint relative to an object changes, the retinal representation of the object tremendously varies across different dimensions

  • We run different experiments in which subjects and deep convolutional neural networks (DCNN) categorized object images varied across several dimensions

  • Human accuracy was compared with the accuracy of two well-known deep networks (Krizhevsky et al, 2012; Simonyan and Zisserman, 2014) performing the same tasks as humans

Read more

Summary

Introduction

As our viewpoint relative to an object changes, the retinal representation of the object tremendously varies across different dimensions. Position and scale invariance exist in the responses of neurons in area V4 (Rust and DiCarlo, 2010), but these invariances considerably increase as visual information propagates to neurons in inferior temporal (IT) cortex (Brincat and Connor, 2004; Hung et al, 2005; Zoccolan et al, 2005, 2007; Rust and DiCarlo, 2010), where responses are highly consistent when an identical object varies across different dimensions (Cadieu et al, 2013, 2014; Yamins et al, 2013; Murty and Arun, 2015). IT cortex is the only area in the ventral stream which encodes three-dimensional transformations through view specific (Logothetis et al, 1994, 1995) and view invariant (Perrett et al, 1991; Booth and Rolls, 1998) responses

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call