Abstract
Attention mechanisms play a crucial role in improving action recognition performance. A video, a type of 3D data, can be effectively explored using attention mechanisms from temporal, spatial, and channel dimensions. However, existing methods based on 2D CNN tend to deal with complex spatiotemporal information from one or two of the dimensions, which eventually hampers their overall performance. In this paper, we propose a novel Comprehensive Attention Network (CANet) to model spatiotemporal information in all three dimensions adaptively. CANet is composed of three core plug-and-play components, namely the Global Guided Short-term Motion Module (GG-SMM), the Second-order Guided Long-term Motion Module (SG-LMM), and the Spatial Motion Adaptive Module (SMAM). Specifically, (1) the GG-SMM module is designed to represent local motion clues in the short-term temporal dimension to improve the classification accuracy of fast-tempo actions. (2) The SG-LMM module is designed to jointly motivate fine-grained motion information in the long-term temporal and channel dimensions, thereby facilitating the discrimination of long-term motions. (3) The SMAM module is used to represent motion-sensitive regions in the spatial dimension by learning the spatial object offsets. Extensive experiments have been conducted on four widely used action recognition benchmarks, namely, Something-Something V1, Kinetics-400, UCF-101, and HMDB-51. Experimental results demonstrate that the proposed CANet achieves excellent performance compared with other state-of-the-art methods.
Published Version
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have