Abstract

Action prediction in videos refers to inferring the action category label by an early observation of a video. Existing studies mainly focus on exploiting multiple visual cues to enhance the discriminative power of feature representation while neglecting important structure information in videos including interactions and correlations between different object entities. In this paper, we focus on reasoning about the spatial–temporal relations between persons and contextual objects to interpret the observed video part for predicting action categories. With this in mind, we propose a novel spatial–temporal relation reasoning approach that extracts the spatial relations between persons and objects in still frames and explores how these spatial relations change over time. Specifically, for spatial relation reasoning, we propose an improved gated graph neural network to perform spatial relation reasoning between the visual objects in video frames. For temporal relation reasoning, we propose a long short-term graph network to model both the short-term and long-term varying dynamics of the spatial relations with multi-scale receptive fields. By this means, our approach can accurately recognize the video content in terms of fine-grained object relations in both spatial and temporal domains to make prediction decisions. Moreover, in order to learn the latent correlations between spatial–temporal object relations and action categories in videos, a visual semantic relation loss is proposed to model the triple constraints between objects in semantic domain via VTransE. Extensive experiments on five public video datasets (i.e., 20BN-something-something, CAD120, UCF101, BIT-Interaction and HMDB51) demonstrate the effectiveness of the proposed spatial–temporal relation reasoning on action prediction.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call