Abstract

Action recognition is an important yet challenging problem. While attention mechanism is widely used to extract informative features for action recognition, most previous attention models regard spatial attention and temporal attention to be independent. In this paper, we propose a novel nesting spatiotemporal attention network (NST) model in which the spatial attention and the temporal attention closely nests and interact with each other. A nesting spatiotemporal attention block contains a spatial attention module and a nested temporal attention module. The spatial attention module learns features to assign different weights for the spatial areas in each video frame. Based on the spatial attention areas, the temporal attention module assigns different weights for different frames. In this way, the most informative regions in each frame and the most key frames in the video sequence are jointly mined and enhanced. An overall architecture is constructed by inserting the nesting spatiotemporal attention blocks into base networks for action recognition. The proposed model was tested on challenging datasets and the experimental results show that our method outperforms other comparison methods.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.