Long term spatio-temporal modeling for action detection

Makarand Tapaswi,Vijay Kumar,Ivan Laptev

doi:10.1016/j.cviu.2021.103242

Makarand Tapaswi, Vijay Kumar + Show 1 more

Open Access

https://doi.org/10.1016/j.cviu.2021.103242

Copy DOI

Abstract

Modeling person interactions with their surroundings has proven to be effective for recognizing and localizing human actions in videos. While most recent works focus on learning short term interactions, in this work, we consider long-term person interactions and jointly localize actions of multiple actors over an entire video shot. We construct a graph with nodes that correspond to keyframe actor instances and connect them with two edge types. Spatial edges connect actors within a keyframe, and temporal edges connect multiple instances of the same actor over a video shot. We propose a Graph Neural Network that explicitly models spatial and temporal states for each person instance and learns to effectively combine information from both modalities to make predictions at the same time. We conduct experiments on the AVA dataset and show that our graph-based model provides consistent improvements over several video descriptors, achieving state-of-the-art performance without any fine-tuning.

Full Text