Abstract
The development of deep convolutional neural networks (CNNs) has recently led to great successes in computer vision, and CNNs have become de facto computational models of vision. However, a growing body of work suggests that they exhibit critical limitations on tasks beyond image categorization. Here, we study one such fundamental limitation, concerning the judgment of whether two simultaneously presented items are the same or different (SD) compared with a baseline assessment of their spatial relationship (SR). In both human subjects and artificial neural networks, we test the prediction that SD tasks recruit additional cortical mechanisms which underlie critical aspects of visual cognition that are not explained by current computational models. We thus recorded electroencephalography (EEG) signals from human participants engaged in the same tasks as the computational models. Importantly, in humans the two tasks were matched in terms of difficulty by an adaptive psychometric procedure; yet, on top of a modulation of evoked potentials (EPs), our results revealed higher activity in the low β (16–24 Hz) band in the SD compared with the SR conditions. We surmise that these oscillations reflect the crucial involvement of additional mechanisms, such as working memory and attention, which are missing in current feed-forward CNNs.
Highlights
Our results suggest a significant difference, both in evoked potentials (EPs) and in the oscillatory dynamics, of the EEG signals measured from human participants performing these two tasks
The field of artificial vision witnessed an impressive boost in the last few years, driven by the striking results of deep convolutional neural networks (CNNs)
The effortless ability of humans and other animals (Wasserman et al, 2012; Daniel et al, 2015) to learn SD tasks suggest the possible involvement of additional computations that are lacking in CNNs, possibly achieving items identification or segmentation
Summary
The field of artificial vision witnessed an impressive boost in the last few years, driven by the striking results of deep convolutional neural networks (CNNs). Such hierarchical neural networks process information sequentially, through a feedforward cascade of filtering, rectification, and normalization operations. The accuracy of these architectures is approaching, sometimes exceeding, that of human observers on key visual recognition tasks including object (He et al, 2016) and face recognition (Phillips et al, 2018). Kim and colleagues demonstrated that CNNs can more learn the first class of problems compared with the second one
Published Version (
Free)
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have