Abstract

Action recognition is an important yet challenging problem. While attention mechanism is widely used to extract informative features for action recognition, most previous attention models regard spatial attention and temporal attention to be independent. In this paper, we propose a novel nesting spatiotemporal attention network (NST) model in which the spatial attention and the temporal attention closely nests and interact with each other. A nesting spatiotemporal attention block contains a spatial attention module and a nested temporal attention module. The spatial attention module learns features to assign different weights for the spatial areas in each video frame. Based on the spatial attention areas, the temporal attention module assigns different weights for different frames. In this way, the most informative regions in each frame and the most key frames in the video sequence are jointly mined and enhanced. An overall architecture is constructed by inserting the nesting spatiotemporal attention blocks into base networks for action recognition. The proposed model was tested on challenging datasets and the experimental results show that our method outperforms other comparison methods.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call