Abstract

Human action analysis is a critical yet challenging task for understanding diverse video content. Recently, to enable explainable reasoning of video actions, a spatio-temporal video graph structure was proposed to represent the video state changes at the semantic level. However, its requirement of tedious manual annotation of all the video frames is a serious limitation. Obviously, this approach would have a tremendously expanded applicability if the video graph generation process can be automated. In this paper, a One-Shot Video Graph (OSVG) generation approach is proposed for more effective explainable action reasoning, which only requires a one-time annotation of the objects in the starting frame of the video. We first estimate the predefined relevant objects across the temporal dimension by employing a proposed one-shot target-aware tracking strategy. This helps obtain the object locations and links objects simultaneously across all video frames. Then, the scene graph of each video frame can be constructed by an attribute detector and a relationship detector based on the estimated object locations. In addition, to further enhance the reasoning accuracy of performed actions, a video graph smoothing mechanism is designed with a fully-connected Conditional Random Field (CRF). By sequentially examining every state transition (including attributes and relationships) of the smoothed video graph, the actions occurring can be recognized using pre-defined rules. Experiments on the CAD-120++ dataset and a newly collected NTU RGBD++ dataset have verified that the proposed OSVG is able to outperform other state-of-the-art video action reasoning strategies on both state recognition and action recognition accuracy.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call