Introduction: Humans can rapidly categorize scenes (R. VanRullen & S. Thorpe, 2001), even using peripheral vision (Larson & Loschky, 2009). Various computational models have been proposed for rapid scene categorization in terms of low-level properties such as spatial envelopes (Oliva & Torralba, 2001) and texture summary statistics (TTM, Rosenholtz et al., 2012). Yet, these models do not explicitly model the foveated properties of the visual system nor the interaction between eye movements and the scene category task. We propose a model with a foveated visual system and eye movements that can predict the dependence of human categorization performance across fixations. The model combines square pooling regions with the computer vision-transformer architecture (Dosovitskiy et al., 2020, Touvron et al., 2020) and makes multiple fixations to maximize classification using the technique of self-attention (Parikh et al., 2016, Bahdanau et al., 2015). Methods: Twenty-two participants classified 360 images (Places365 database, places2.csail.mit.edu) into 30 classes. Images subtended a viewing angle of 22.7 degrees. A gaze-contingent display was used to randomly interrupt the display after 1, 2, 3, or 4 fixations with initial forced-fixation at bottom-center or top-center. Results: We show that there is no significant improvement in performance after the 2nd fixation (Δ correct categorization=0.015; p=0.4729), unlike performance for object search (Koehler and Eckstein, 2017). The model correctly predicts modest classification improvements for free-viewing fixations (Δ=0.016). The model-human correlation in classification choices was not significantly lower than human-human correlations. Our findings suggest that human categorization of scenes within a single fixation can be explained by the spatially global distribution of the visual information in the scene and their availability even through the bottlenecks of the visual periphery. The newly proposed hybrid approach using biologically based modeling and Transformers can flexibly be applied to various naturalistic tasks and stimuli.
Read full abstract