Encoding of Numerosity With Robustness to Object and Scene Identity in Biologically Inspired Object Recognition Networks.
Number sense, the ability to rapidly estimate object quantities in a visual scene without precise counting, is a crucial cognitive capacity found in humans and many other animals. Recent studies have identified artificial neurons tuned to numbers of items in biologically inspired vision models, even before training, and proposed these artificial neural networks as candidate models for the emergence of number sense in the brain. But real-world numerosity perception requires abstraction from the properties of individual objects and their contexts, unlike the simplified dot patterns used in previous studies. Using novel synthetically generated photorealistic stimuli, we show that deep convolutional neural networks optimized for object recognition encode information on approximate numerosity across diverse objects and scene types, which could be linearly read out from distributed activity patterns of later convolutional layers of different network architectures tested. In contrast, untrained networks with random weights failed to represent numerosity with abstractness to other visual properties and instead captured mainly low-level visual features. Our findings emphasize the importance of using complex, naturalistic stimuli to investigate mechanisms of number sense in both biological and artificial systems, and they suggest that the capacity of untrained networks to account for early-life numerical abilities should be reassessed. They further point to a possible, so far underappreciated, contribution of the brain's ventral visual pathway to representing numerosity with abstractness to other high-level visual properties.
- Peer Review Report
5
- 10.7554/elife.71736.sa2
- Feb 9, 2022
Successful engagement with the world requires the ability to predict what will happen next. Here, we investigate how the brain makes a fundamental prediction about the physical world: whether the situation in front of us is stable, and hence likely to stay the same, or unstable, and hence likely to change in the immediate future. Specifically, we ask if judgments of stability can be supported by the kinds of representations that have proven to be highly effective at visual object recognition in both machines and brains, or instead if the ability to determine the physical stability of natural scenes may require generative algorithms that simulate the physics of the world. To find out, we measured responses in both convolutional neural networks (CNNs) and the brain (using fMRI) to natural images of physically stable versus unstable scenarios. We find no evidence for generalizable representations of physical stability in either standard CNNs trained on visual object and scene classification (ImageNet), or in the human ventral visual pathway, which has long been implicated in the same process. However, in frontoparietal regions previously implicated in intuitive physical reasoning we find both scenario-invariant representations of physical stability, and higher univariate responses to unstable than stable scenes. These results demonstrate abstract representations of physical stability in the dorsal but not ventral pathway, consistent with the hypothesis that the computations underlying stability entail not just pattern classification but forward physical simulation.
- Research Article
182
- 10.1016/s0896-6273(00)80824-7
- Sep 1, 1999
- Neuron
Are cortical models really bound by the "binding problem"?
- Peer Review Report
- 10.7554/elife.69736.sa1
- Jun 3, 2021
Decision letter: Causal neural mechanisms of context-based object recognition
- Research Article
14
- 10.1023/a:1008088813977
- Sep 1, 1998
- International Journal of Computer Vision
A major problem in object recognition is that a novel image of a given object can be different from all previously seen images. Images can vary considerably due to changes in viewing conditions such as viewing position and illumination. In this paper we distinguish between three types of recognition schemes by the level at which generalization to novel images takes place: universal, class, and model-based. The first is applicable equally to all objects, the second to a class of objects, and the third uses known properties of individual objects. We derive theoretical limitations on each of the three generalization levels. For the universal level, previous results have shown that no invariance can be obtained. Here we show that this limitation holds even when the assumptions made on the objects and the recognition functions are relaxed. We also extend the results to changes of illumination direction. For the class level, previous studies presented specific examples of classes of objects for which functions invariant to viewpoint exist. Here, we distinguish between classes that admit such invariance and classes that do not. We demonstrate that there is a tradeoff between the set of objects that can be discriminated by a given recognition function and the set of images from which the recognition function can recognize these objects. Furthermore, we demonstrate that although functions that are invariant to illumination direction do not exist at the universal level, when the objects are restricted to belong to a given class, an invariant function to illumination direction can be defined. A general conclusion of this study is that class-based processing, that has not been used extensively in the past, is often advantageous for dealing with variations due to viewpoint and illuminant changes.
- Preprint Article
- 10.32920/ryerson.14651541.v1
- May 23, 2021
Object recognition has become a central topic in computer vision applications such as image search, robotics and vehicle safety systems. However, it is a challenging task due to the limited discriminative power of low-level visual features in describing the considerably diverse range of high-level visual semantics of objects. Semantic gap between low-level visual features and high-level concepts are a bottleneck in most systems. New content analysis models need to be developed to bridge the semantic gap. In this thesis, algorithms based on conditional random fields (CRF) from the class of probabilistic graphical models are developed to tackle the problem of multiclass image labeling for object recognition. Image labeling assigns a specific semantic category from a predefined set of object classes to each pixel in the image. By well capturing spatial interactions of visual concepts, CRF modeling has proved to be a successful tool for image labeling. This thesis proposes novel approaches to empowering the CRF modeling for robust image labeling. Our primary contributions are twofold. To better represent feature distributions of CRF potentials, new feature functions based on generalized Gaussian mixture models (GGMM) are designed and their efficacy is investigated. Due to its shape parameter, GGMM can provide a proper fit to multi-modal and skewed distribution of data in nature images. The new model proves more successful than Gaussian and Laplacian mixture models. It also outperforms a deep neural network model on Corel imageset by 1% accuracy. Further in this thesis, we apply scene level contextual information to integrate global visual semantics of the image with pixel-wise dense inference of fully-connected CRF to preserve small objects of foreground classes and to make dense inference robust to initial misclassifications of the unary classifier. Proposed inference algorithm factorizes the joint probability of labeling configuration and image scene type to obtain prediction update equations for labeling individual image pixels and also the overall scene type of the image. The proposed context-based dense CRF model outperforms conventional dense CRF model by about 2% in terms of labeling accuracy on MSRC imageset and by 4% on SIFT Flow imageset. Also, the proposed model obtains the highest scene classification rate of 86% on MSRC dataset.
- Preprint Article
- 10.32920/ryerson.14651541
- May 23, 2021
Object recognition has become a central topic in computer vision applications such as image search, robotics and vehicle safety systems. However, it is a challenging task due to the limited discriminative power of low-level visual features in describing the considerably diverse range of high-level visual semantics of objects. Semantic gap between low-level visual features and high-level concepts are a bottleneck in most systems. New content analysis models need to be developed to bridge the semantic gap. In this thesis, algorithms based on conditional random fields (CRF) from the class of probabilistic graphical models are developed to tackle the problem of multiclass image labeling for object recognition. Image labeling assigns a specific semantic category from a predefined set of object classes to each pixel in the image. By well capturing spatial interactions of visual concepts, CRF modeling has proved to be a successful tool for image labeling. This thesis proposes novel approaches to empowering the CRF modeling for robust image labeling. Our primary contributions are twofold. To better represent feature distributions of CRF potentials, new feature functions based on generalized Gaussian mixture models (GGMM) are designed and their efficacy is investigated. Due to its shape parameter, GGMM can provide a proper fit to multi-modal and skewed distribution of data in nature images. The new model proves more successful than Gaussian and Laplacian mixture models. It also outperforms a deep neural network model on Corel imageset by 1% accuracy. Further in this thesis, we apply scene level contextual information to integrate global visual semantics of the image with pixel-wise dense inference of fully-connected CRF to preserve small objects of foreground classes and to make dense inference robust to initial misclassifications of the unary classifier. Proposed inference algorithm factorizes the joint probability of labeling configuration and image scene type to obtain prediction update equations for labeling individual image pixels and also the overall scene type of the image. The proposed context-based dense CRF model outperforms conventional dense CRF model by about 2% in terms of labeling accuracy on MSRC imageset and by 4% on SIFT Flow imageset. Also, the proposed model obtains the highest scene classification rate of 86% on MSRC dataset.
- Research Article
5
- 10.1038/s42003-023-05565-9
- Nov 27, 2023
- Communications Biology
Visual object recognition has been traditionally conceptualised as a predominantly feedforward process through the ventral visual pathway. While feedforward artificial neural networks (ANNs) can achieve human-level classification on some image-labelling tasks, it’s unclear whether computational models of vision alone can accurately capture the evolving spatiotemporal neural dynamics. Here, we probe these dynamics using a combination of representational similarity and connectivity analyses of fMRI and MEG data recorded during the recognition of familiar, unambiguous objects. Modelling the visual and semantic properties of our stimuli using an artificial neural network as well as a semantic feature model, we find that unique aspects of the neural architecture and connectivity dynamics relate to visual and semantic object properties. Critically, we show that recurrent processing between the anterior and posterior ventral temporal cortex relates to higher-level visual properties prior to semantic object properties, in addition to semantic-related feedback from the frontal lobe to the ventral temporal lobe between 250 and 500 ms after stimulus onset. These results demonstrate the distinct contributions made by semantic object properties in explaining neural activity and connectivity, highlighting it as a core part of object recognition not fully accounted for by current biologically inspired neural networks.
- Research Article
36
- 10.1016/j.cub.2023.10.015
- Nov 1, 2023
- Current biology : CB
Recent theoretical work has argued that in addition to the classical ventral (what) and dorsal (where/how) visual streams, there is a third visual stream on the lateral surface of the brain specialized for processing social information. Like visual representations in the ventral and dorsal streams, representations in the lateral stream are thought to be hierarchically organized. However, no prior studies have comprehensively investigated the organization of naturalistic, social visual content in the lateral stream. To address this question, we curated a naturalistic stimulus set of 250 3-s videos of two people engaged in everyday actions. Each clip was richly annotated for its low-level visual features, mid-level scene and object properties, visual social primitives (including the distance between people and the extent to which they were facing), and high-level information about social interactions and affective content. Using a condition-rich fMRI experiment and a within-subject encoding model approach, we found that low-level visual features are represented in early visual cortex (EVC) and middle temporal (MT) area, mid-level visual social features in extrastriate body area (EBA) and lateral occipital complex (LOC), and high-level social interaction information along the superior temporal sulcus (STS). Communicative interactions, in particular, explained unique variance in regions of the STS after accounting for variance explained by all other labeled features. Taken together, these results provide support for representation of increasingly abstract social visual content-consistent with hierarchical organization-along the lateral visual stream and suggest that recognizing communicative actions may be a key computational goal of the lateral visual pathway.
- Research Article
56
- 10.1016/j.neucom.2020.06.078
- Jun 24, 2020
- Neurocomputing
A multi-path adaptive fusion network for multimodal brain tumor segmentation
- Research Article
159
- 10.1162/jocn_a_00924
- May 1, 2016
- Journal of Cognitive Neuroscience
Objects belonging to different categories evoke reliably different fMRI activity patterns in human occipitotemporal cortex, with the most prominent distinction being that between animate and inanimate objects. An unresolved question is whether these categorical distinctions reflect category-associated visual properties of objects or whether they genuinely reflect object category. Here, we addressed this question by measuring fMRI responses to animate and inanimate objects that were closely matched for shape and low-level visual features. Univariate contrasts revealed animate- and inanimate-preferring regions in ventral and lateral temporal cortex even for individually matched object pairs (e.g., snake-rope). Using representational similarity analysis, we mapped out brain regions in which the pairwise dissimilarity of multivoxel activity patterns (neural dissimilarity) was predicted by the objects' pairwise visual dissimilarity and/or their categorical dissimilarity. Visual dissimilarity was measured as the time it took participants to find a unique target among identical distractors in three visual search experiments, where we separately quantified overall dissimilarity, outline dissimilarity, and texture dissimilarity. All three visual dissimilarity structures predicted neural dissimilarity in regions of visual cortex. Interestingly, these analyses revealed several clusters in which categorical dissimilarity predicted neural dissimilarity after regressing out visual dissimilarity. Together, these results suggest that the animate-inanimate organization of human visual cortex is not fully explained by differences in the characteristic shape or texture properties of animals and inanimate objects. Instead, representations of visual object properties and object category may coexist in more anterior parts of the visual system.
- Book Chapter
14
- 10.1016/bs.plm.2019.03.002
- Jan 1, 2019
Neural dynamics of visual and semantic object processing
- Research Article
16
- 10.1093/texcom/tgad003
- Jan 6, 2023
- Cerebral Cortex Communications
Despite their anatomical and functional distinctions, there is growing evidence that the dorsal and ventral visual pathways interact to support object recognition. However, the exact nature of these interactions remains poorly understood. Is the presence of identity-relevant object information in the dorsal pathway simply a byproduct of ventral input? Or, might the dorsal pathway be a source of input to the ventral pathway for object recognition? In the current study, we used high-density EEG-a technique with high temporal precision and spatial resolution sufficient to distinguish parietal and temporal lobes-to characterise the dynamics of dorsal and ventral pathways during object viewing. Using multivariate analyses, we found that category decoding in the dorsal pathway preceded that in the ventral pathway. Importantly, the dorsal pathway predicted the multivariate responses of the ventral pathway in a time-dependent manner, rather than the other way around. Together, these findings suggest that the dorsal pathway is a critical source of input to the ventral pathway for object recognition.
- Research Article
32
- 10.1371/journal.pone.0209256
- Dec 14, 2018
- PloS one
Dyscalculia, a specific learning disability that impacts arithmetical skills, has previously been associated to a deficit in the precision of the system that estimates the approximate number of objects in visual scenes (the so called ‘number sense’ system). However, because in tasks involving numerosity comparisons dyscalculics’ judgements appears disproportionally affected by continuous quantitative dimensions (such as the size of the items), an alternative view linked dyscalculia to a domain-general difficulty in inhibiting task-irrelevant responses. To arbitrate between these views, we evaluated the degree of reciprocal interference between numerical and non-numerical quantitative dimensions in adult dyscalculics and matched controls. We used a novel stimulus set orthogonally varying in mean item size and numerosity, putting particular attention into matching both features’ perceptual discriminability. Participants compared those stimuli based on each of the two dimensions. While control subjects showed no significant size interference when judging numerosity, dyscalculics’ numerosity judgments were strongly biased by the unattended size dimension. Importantly however, both groups showed the same degree of interference from the unattended dimension when judging mean size. Moreover, only the ability to discard the irrelevant size information when comparing numerosity (but not the reverse) significantly predicted calculation ability across subjects. Overall, our results show that numerosity discrimination is less prone to interference than discrimination of another quantitative feature (mean item size) when the perceptual discriminability of these features is matched, as here in control subjects. By quantifying, for the first time, dyscalculic subjects’ degree of interference on another orthogonal dimension of the same stimuli, we are able to exclude a domain-general inhibition deficit as explanation for their poor / biased numerical judgement. We suggest that enhanced reliance on non-numerical cues during numerosity discrimination can represent a strategy to cope with a less precise number sense.
- Research Article
15
- 10.1016/j.cub.2009.02.014
- Apr 1, 2009
- Current Biology
Visual Perception: Converging Mechanisms of Attention, Binding, and Segmentation?
- Book Chapter
5
- 10.1007/978-94-009-3833-5_4
- Jan 1, 1987
One of the most important tasks of the visual system in man is the recognition and identification of objects. An object within the field of view has place as well as form, color, size, texture, depth and motion. Many investigations have established the importance of shape for recognition.1 However, clinical and theoretical discussions of recognition of objects have tended to ignore the role of other visual properties in guiding recognition. This has been partly because shape information is usually more pertinent for manipulation purposes than any other object properties, but perhaps also because it is easier to think in terms of the geometry of spatial relations. Yet everyday visual experience tells one that other visual properties, such as texture, the pattern of an object’s material surface, can reliably guide recognition. For example, one can recognize a pineapple solely on the basis of the pattern of its skin, without needing to rely on additional information about its shape, size, or color. However, it is often harder to identify a lemon, an orange and even a canteloupe only on the basis of their skin pattern. How does this about?
- New
- Research Article
- 10.1162/neco.a.36
- Oct 29, 2025
- Neural computation
- New
- Research Article
- 10.1162/neco.a.37
- Oct 29, 2025
- Neural computation
- New
- Research Article
- 10.1162/neco.a.38
- Oct 29, 2025
- Neural computation
- Research Article
- 10.1162/neco.a.27
- Oct 10, 2025
- Neural computation
- Research Article
- 10.1162/neco.a.35
- Oct 10, 2025
- Neural computation
- Research Article
- 10.1162/neco.a.34
- Oct 10, 2025
- Neural computation
- Research Article
- 10.1162/neco.a.30
- Oct 10, 2025
- Neural computation
- Research Article
- 10.1162/neco.a.28
- Sep 22, 2025
- Neural computation
- Research Article
- 10.1162/neco.a.29
- Sep 22, 2025
- Neural computation
- Research Article
- 10.1162/neco.a.26
- Sep 22, 2025
- Neural computation
- Ask R Discovery
- Chat PDF
AI summaries and top papers from 250M+ research sources.