Abstract

This paper proposes a novel architecture for spatial-temporal action localization in videos. The new architecture first employs a two-stream 3D convolutional neural network (3D-CNN) to provide initial action detection. Next, a new Hierarchical Self-Attention Network (HiSAN), the core of this architecture, is developed to learn the spatial-temporal relationships of key actors. Spatial Gaussian priors (SGP) are also imbued to the bidirectional self-attention to enhance HiSAN in modelling the relationships of neighboring actors. Such a combination of 3D-CNN and SGP augmented HiSAN allows us to effectively extract both of the spatial context information and the long-term temporal dependency to improve action localization accuracy. Afterwards, a new fusion strategy is employed, which first re-scores the bounding boxes to settle the inconsistent detection scores caused by background clutter or occlusion, and then aggregates the motion and appearance information from the two-stream network with the motion saliency to alleviate the impact of camera movement. Finally, a tube association network based on the self-similarity of the actors’ appearance and spatial information across frames is addressed to efficaciously construct the action tubes. Simulations on four widespread datasets reveal the efficacy of the new approach.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call