Abstract

Deep convolutional neural networks (DCNNs) have attracted much attention recently, and have shown to be able to recognize thousands of object categories in natural image databases. Their architecture is somewhat similar to that of the human visual system: both use restricted receptive fields, and a hierarchy of layers which progressively extract more and more abstracted features. Yet it is unknown whether DCNNs match human performance at the task of view-invariant object recognition, whether they make similar errors and use similar representations for this task, and whether the answers depend on the magnitude of the viewpoint variations. To investigate these issues, we benchmarked eight state-of-the-art DCNNs, the HMAX model, and a baseline shallow model and compared their results to those of humans with backward masking. Unlike in all previous DCNN studies, we carefully controlled the magnitude of the viewpoint variations to demonstrate that shallow nets can outperform deep nets and humans when variations are weak. When facing larger variations, however, more layers were needed to match human performance and error distributions, and to have representations that are consistent with human behavior. A very deep net with 18 layers even outperformed humans at the highest variation level, using the most human-like representations.

Highlights

  • Primates excel at view-invariant object recognition[1]

  • This approach led to new findings: (1) Deeper was usually better and more human-like, but only in the presence of large variations; (2) Some Deep convolutional neural networks (DCNNs) reached human performance even with large variations; (3) Some DCNNs had error distributions which were indiscernible from those of humans; (4) Some DCNNs used representations that were more consistent with human responses, and these were not necessarily the top performers

  • We tested the DCNNs in our invariant object categorization task including five object categories, seven variation levels, and two background conditions

Read more

Summary

OPEN Deep Networks Can Resemble

Human Feed-forward Vision in Invariant Object Recognition received: 19 August 2015 accepted: 11 August 2016 Published: 07 September 2016. The advantages of our work with respect to previous studies are: (1) we used a larger object database, divided into five categories; (2) most importantly, we controlled and varied the magnitude of the variations in size, position, in-depth and in-plane rotations; (3) we benchmarked eight state-of-the-art DCNNs, the HMAX model[10] (an early biologically inspired shallow model), and a very simple shallow model that classifies directly from the pixel values (“Pixel”); (4) in our psychophysical experiments, the images were presented briefly and with backward masking, presumably blocking feedback; (5) we performed extensive comparisons between different layers of DCNNs and studied how invariance evolves through the layers; (6) we compared models and humans in terms of performance, error distributions, and representational geometry; and (7) to measure the influence of the background on the invariant object recognition problem our dataset included both segmented and unsegmented images. This approach led to new findings: (1) Deeper was usually better and more human-like, but only in the presence of large variations; (2) Some DCNNs reached human performance even with large variations; (3) Some DCNNs had error distributions which were indiscernible from those of humans; (4) Some DCNNs used representations that were more consistent with human responses, and these were not necessarily the top performers

Materials and Methods
Results
Author Contributions
Additional Information

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.