Object-classifying neural networks have animate and inanimate feature subspaces with partially distinct representational geometries

Daniel Janini

doi:10.1167/jov.22.14.4459

Abstract

The human visual system has different features preferentially responsive to animate versus inanimate objects. A range of possibilities exist for how these features represent entities within each domain. At one extreme, animate vs inanimate features have completely unique representational geometries specialized for entities within their domain. At the other extreme, these two sets of features have the exact same representational geometries. Here, we investigated these possibilities in object-classifying neural networks. We found that a large proportion of the features in each layer of AlexNet selectively respond to either animate (17-45%) or inanimate objects (5-21%). Next, we compared the within-domain representational geometries of the animate vs inanimate feature subspaces. Across the layers of AlexNet, the two feature subspaces differed widely in the correlation of their representational geometries (r = -0.11-0.68). However, in every layer, the two subspaces were less correlated than randomly selected subspaces of the same size. This was true at the category-level (e.g., cow vs donkey) and the exemplar-level (e.g., one cow vs another). To compare these subspaces to human perception, we measured the pairwise visual similarity of 72 animal categories in a visual search experiment (n = 248). Instead of finding a unique correspondence between the animate subspace and animal perception, the animate- and inanimate-selective subspaces both exhibited moderate correlations with the visual similarity of animals (Spearman rho ranges: 0.43-0.59 and 0.27-0.65, noise ceiling: rho = 0.91). Thus, AlexNet has animate- and inanimate-selective subspaces with partially distinct representational geometries. However, AlexNet’s animate subspace at best exhibited a moderate and non-specific correspondence to animal perception in humans. This pattern of results may arise because AlexNet learns different visual features than humans and/or because the human visual system does not contain unique representational geometries for animate and inanimate objects. Future neuroimaging work will attempt to tease apart these possibilities.

Full Text