Humans easily make social evaluations from visual input, such as recognizing a social interaction or deciding who is a friend and who is a foe. Prior efforts to model this ability have suggested that visual features and motion cues alone cannot account for human performance in social tasks, like distinguishing between helping and hindering interactions. On the other hand, generative Bayesian inference models, which make predictions based on simulations of agents’ social or physical goals, accurately predict human judgments, but are computationally very expensive and often impossible to implement in natural visual stimuli. Inspired by developmental work, we hypothesize that introducing inductive biases in visual models would allow them to make more human-like social judgments. Specifically, in this study, we investigate if relational representations of visual stimuli using graph neural networks can predict human social interaction judgments. We use the PHASE dataset, consisting of 2D animations of two agents and two objects resembling real-life social interactions, rated as friendly, neutral, or adversarial. We propose a graph neural networks (GNN) based architecture, that takes in graph representations of a video and predicts the relationship between the agents. We collected human ratings for each of the 400 videos and found that our GNN model aligns with human judgments significantly better than a baseline visual model with the same visual/motion information but without the graph structure (79% vs 62% prediction of human judgements). Intriguingly, explicitly adding relational information to the baseline model does not improve its performance, suggesting that graphical representations in particular are important to modeling human judgements. Taken together, these results suggest that relational graphical representations of visual information can help artificial vision systems make more human-like social judgments without incurring the computational cost of Bayesian models, and provide insights into the computations that humans employ while making visual social judgments.
Read full abstract