Abstract

Video-based human-object interaction recognition is a challenging task since the state of objects as well as their correlations change constantly in the video. Existing methods mainly use 3DCNN or use separate components ( <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">e.g</i> ., GCN + RNN) to model the spatial correlation or the temporal correlation respectively, but ignore modeling spatio-temporal correlations simultaneously and long-term temporal dynamics of objects. In this paper, we propose a novel model, named Spatio-Temporal Interaction Graph Parsing Networks (STIGPN), for human-object interaction recognition in videos. STIGPN captures both spatial and temporal correlations simultaneously and thus can capture intra-frame and inter-frame dependencies efficiently and effectively. To model long-term temporal dynamics of objects, we introduce spatio-temporal feature enhancement, which can improve the detection of the salient human-object interaction pairs. We explore three types of spatio-temporal graph convolutions to simultaneously capture the spatio-temporal correlations and assess their effectiveness as the basic building block of STIGPN. Extensive experiments on CAD-120, Something-Else and Charades datasets show that our proposed solution leads to competitive results compared with the state-of-the-art methods. Code for STIGPN is available at: https://github.com/NingWang2049/STIGPN2.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call