Invited Session IV: Top-down vs. bottom-up approaches to computational modeling of vision: Limits of prediction accuracy on randomly selected natural images for model evaluation

Mark Lescroart

doi:10.1167/jov.22.3.52

Abstract

Prediction accuracy on held-out data has become a critical analysis for quantitative model evaluation and hypothesis testing in computational cognitive neuroscience. In this talk, I will discuss the limits of prediction accuracy as a standalone metric, and highlight other considerations for model evaluation and interpretation. First, comparing two models on prediction accuracy alone does not reveal the degree to which the models have common underlying factors. I will advocate addressing this issue with variance partitioning, a form of commonality analysis, to reveal shared and unique variance explained by different models. Concretely, I will show how variance partitioning reveals representation of body parts and object boundaries in responses to multiple data sets of movie stimuli. Second, prediction accuracy is a metric for the variance explained by a given model. But for any experiment, the stimulus constrains the variance in the measured brain responses. Any given stimulus set runs the risk of excluding important sources of variation. A popular way to address this issue is to use photographs or movie clips as stimuli. Such naturalistic stimuli are typically sampled broadly from the world and thus have increased ecological validity, but random selection of natural stimuli often results in correlated features both within and between models. This often leads to ambiguous results, e.g. shared variance between models intended to capture different types of features. Furthermore, I will show that the same models (again of body parts and object boundaries) can yield quantitatively and in some cases qualitatively different results when applied to different data sets. This raises a critical question: if results for the same model vary across stimulus sets, which result provides a more solid basis for future work? Just as two clocks telling different times need a reference clock to be set, I will argue that we need broadly sampled sets of natural stimuli to use as a baseline for what, in various feature domains, constitutes "natural" variation and covariation. I will describe our collaboration to create just such a dataset of human visual experience, in the form of hundreds of hours of first-person video.

Full Text