Abstract

Predicting other people's upcoming action is key to successful social interactions. Previous studies have started to disentangle the various sources of information that action observers exploit, including objects, movements, contextual cues and features regarding the acting person's identity. We here focus on the role of static and dynamic inter-object spatial relations that change during an action. We designed a virtual reality setup and tested recognition speed for ten different manipulation actions. Importantly, all objects had been abstracted by emulating them with cubes such that participants could not infer an action using object information. Instead, participants had to rely only on the limited information that comes from the changes in the spatial relations between the cubes. In spite of these constraints, participants were able to predict actions in, on average, less than 64% of the action's duration. Furthermore, we employed a computational model, the so-called enriched Semantic Event Chain (eSEC), which incorporates the information of different types of spatial relations: (a) objects' touching/untouching, (b) static spatial relations between objects and (c) dynamic spatial relations between objects during an action. Assuming the eSEC as an underlying model, we show, using information theoretical analysis, that humans mostly rely on a mixed-cue strategy when predicting actions. Machine-based action prediction is able to produce faster decisions based on individual cues. We argue that human strategy, though slower, may be particularly beneficial for prediction of natural and more complex actions with more variable or partial sources of information. Our findings contribute to the understanding of how individuals afford inferring observed actions' goals even before full goal accomplishment, and may open new avenues for building robots for conflict-free human-robot cooperation.

Highlights

  • Human beings excel at recognizing actions performed by others, and they do so even before the action goal has been effectively achieved [1, 2]

  • To model human action prediction based on enriched Semantic Event Chain (eSEC) matrices, we calculated the informational gain based on each eSEC column entry

  • In the human reaction time experiments, response times that exceeded the length of the action video were treated as time-outs and corresponding trials (13 out of 14700) were excluded from further analyses

Read more

Summary

Introduction

Human beings excel at recognizing actions performed by others, and they do so even before the action goal has been effectively achieved [1, 2]. A major aim of ongoing research is to disentangle the respective contribution and relevance of these sources of information feeding human action prediction Since these sources are largely confounded even in simple instances of natural action, the experimental approach has to fully control or to bluntly eliminate all potentially confounding sources that are not in the focus of empirical testing. As the basis for spatial relation calculation we use extended semantic event chains (eSEC), introduced in our previous work for action recognition in computer vision [22] This approach allows us to determine a sequence of discrete spatial relations between different objects in the scene throughout the manipulation. In the Appendix 5 we provide the details of the machine algorithms

Virtual reality videos
Behavioural study on action prediction
Machine prediction
Comparison of human and machine predictive performance
Information theoretical analysis
Results
Discussion
Limitations
Scenario recording
Stimuli
Details of machine action prediction
Mathematical definition of the spatial relations
Similarity measure between eSECs
Xk X 10
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call