Abstract
Visual encoding models are important computational models for understanding how information is processed along the visual stream. Many improved visual encoding models have been developed from the perspective of the model architecture and the learning objective, but these are limited to the supervised learning method. From the view of unsupervised learning mechanisms, this paper utilized a pre-trained neural network to construct a visual encoding model based on contrastive self-supervised learning for the ventral visual stream measured by functional magnetic resonance imaging (fMRI). We first extracted features using the ResNet50 model pre-trained in contrastive self-supervised learning (ResNet50-CSL model), trained a linear regression model for each voxel, and finally calculated the prediction accuracy of different voxels. Compared with the ResNet50 model pre-trained in a supervised classification task, the ResNet50-CSL model achieved an equal or even relatively better encoding performance in multiple visual cortical areas. Moreover, the ResNet50-CSL model performs hierarchical representation of input visual stimuli, which is similar to the human visual cortex in its hierarchical information processing. Our experimental results suggest that the encoding model based on contrastive self-supervised learning is a strong computational model to compete with supervised models, and contrastive self-supervised learning proves an effective learning method to extract human brain-like representations.
Highlights
Understanding how the human brain functions is a subject that neuroscientists are constantly exploring, and the visual system is one of the most widely and deeply studied sensory systems [1]
The visual encoding model based on Functional magnetic resonance imaging (fMRI) is a mathematical model that simulates the process of brain visual information processing to predict fMRI activity for any visual input stimulus based on a known or assumed visual perception mechanism, and it describes the relationship between visual inputs and fMRI responses [5,6]
The fMRI data were divided into seven distinct visual regions of interest, including V1, V2, V3, V4, the lateral occipital complex (LOC), the parahippocampal place area (PPA), and the fusiform face area (FFA)
Summary
Understanding how the human brain functions is a subject that neuroscientists are constantly exploring, and the visual system is one of the most widely and deeply studied sensory systems [1]. The visual encoding model based on fMRI is a mathematical model that simulates the process of brain visual information processing to predict fMRI activity for any visual input stimulus based on a known or assumed visual perception mechanism, and it describes the relationship between visual inputs and fMRI responses [5,6]. Visual information is processed by a cascade of neural computations [7,8] This process is extremely complex; the mapping from the input stimulus space to the brain activity space can be regarded as nonlinear. Due to the unclear mechanism of brain visual information processing, it is difficult to directly construct a model to characterize such nonlinear relationships; a linearizing feature space is usually introduced to assist the model building [9]. The construction of the feature space is the core of the linearizing encoding model, which determines the encoding performance
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.