Understanding what others are doing is a fundamental aspect of social cognition and a skill that is arguably linked to visuospatial perspective taking (VPT), the ability to apprehend the spatial layout of a scene from another's perspective. Yet, with few and notable exceptions, action understanding and VPT are rarely studied together. Participants (43 females, 37 males) made judgements about the spatial layout of objects in a scene from the perspective of an avatar who was positioned at 0°, 90°, 270° or 180° relative to the participant. In a variant of a traditional VPT task, the avatar either interacted with the objects in the scene, by pointing to or reaching for them, or was present but did not engage with the objects. Although the task was identical across all conditions - to say whether a target object is to the right or left of a control object - we show that the avatar's actions modulates performance. Specifically, participants were more accurate when the avatar engaged with the target object, and correspondingly, less accurate and slower when the avatar interacted with the control objects. As these effects were independent of the angular disparity between participant and avatar perspectives, we conclude that action understanding and VPT are likely linked via the rapid deployment of two separate cognitive mechanisms. All participants provided a measure of self-reported empathy and we show that response times decrease with increasing empathy scores for female but not for male participants. However, within the range of ‘typical’ empathy scores, defined here as the interquartile range where 50 % of the data lie, females were faster than males. These findings lend further insight into the relationship between spatial and social perspective taking.