How well do Deep Neural Networks model Human Vision?

John Clevenger,Diane Beck

doi:10.1167/16.12.176

Abstract

Recently there has been dramatic improvement in computer-vision object recognition. In the 2015 ImageNet challenge, the best performing model (GoogLeNet) had a top-5 classification accuracy of 93%, a 20% improvement over 2010. This increase is due to the continued development of convolutional neural networks (CNN). Despite these advances, it's unclear whether these biologically-inspired models recognize objects similarly to humans. To begin investigating this question, we compared GoogLeNet and human performance on the same images. If humans and CNNs share recognition processes, we should find similarities in which images are difficult/easy across groups. We used images taken from the 2015 ImageNet challenge, spanning a variety of categories. Importantly, half were images that GoogLetNet correctly classified in the 2015 ImageNet challenge and half were images that it incorrectly classify. We then tested human performance on these images using a cued detection task. In order to avoid ceiling effects, the images were briefly presented (< 100 ms, determined per subject) and masked. A category name was shown either before or after the image and people were asked whether or not the image matched the category (which it did half the time). We found that people required 2.5 times more exposure time to recognize images when the category was cued before the image rather than after, consistent with a role for top-down knowledge/expectation in human recognition. However, at the image-level accuracy was highly correlated across pre and post-cues (r =.82), indicating that some images are harder than others regardless of how they are cued. Importantly, people were substantially better at recognizing the images that GoogLetNet correctly (85%) rather than incorrectly (58%) categorized. This might be suggestive of shared processes. However, within the set of images that GoogLeNet got incorrect, human performance ranged from 9% to 100%, indicating substantial departure between human and machine. Meeting abstract presented at VSS 2016

Full Text