Abstract

Recognizing the actions of others from visual stimuli is a crucial aspect of human perception that allows individuals to respond to social cues. Humans are able to discriminate between similar actions despite transformations, like changes in viewpoint or actor, that substantially alter the visual appearance of a scene. This ability to generalize across complex transformations is a hallmark of human visual intelligence. Advances in understanding action recognition at the neural level have not always translated into precise accounts of the computational principles underlying what representations of action sequences are constructed by human visual cortex. Here we test the hypothesis that invariant action discrimination might fill this gap. Recently, the study of artificial systems for static object perception has produced models, Convolutional Neural Networks (CNNs), that achieve human level performance in complex discriminative tasks. Within this class, architectures that better support invariant object recognition also produce image representations that better match those implied by human and primate neural data. However, whether these models produce representations of action sequences that support recognition across complex transformations and closely follow neural representations of actions remains unknown. Here we show that spatiotemporal CNNs accurately categorize video stimuli into action classes, and that deliberate model modifications that improve performance on an invariant action recognition task lead to data representations that better match human neural recordings. Our results support our hypothesis that performance on invariant discrimination dictates the neural representations of actions computed in the brain. These results broaden the scope of the invariant recognition framework for understanding visual intelligence from perception of inanimate objects and faces in static images to the study of human perception of action sequences.

Highlights

  • Humans’ ability to recognize the actions of others is a crucial aspect of visual perception

  • Within the class of Spatiotemporal Convolutional Neural Networks (ST-CNN), deliberate model modifications leading to representations of videos that better support robust action discrimination, produce representations that better match human neural data

  • Increasing model performance on invariant action recognition leads to a better match with human neural data, despite the model never being exposed to such data. These results suggest that, to what is known for object recognition, supporting invariant discrimination within the constraints of hierarchical ST-CNN architectures drives the neural mechanisms underlying our ability to perceive the actions of others

Read more

Summary

Introduction

Humans’ ability to recognize the actions of others is a crucial aspect of visual perception. Over the past few decades, artificial systems for action processing have received considerable attention. These methods can be divided into global and local approaches. On the other hand, extract information from video sequences in a bottom-up fashion, by detecting, in their input video, the presence of features that are local in space and time. These local descriptors are combined, following a hierarchical architecture, to construct more complex representations [8,9,10]. A specific class of bottom up, local architectures, spatial-temporal Convolutional Neural Networks (ST-CNNs), as well as their recursive extensions [11], are currently the best performing models on action recognition tasks

Methods
Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call