CANet: Comprehensive Attention Network for video-based action recognition

Xiong Gao,Zhaobin Chang,Xingcheng Ran,Yonggang Lu

doi:10.1016/j.knosys.2024.111852

Xiong Gao, Zhaobin Chang + Show 2 more

https://doi.org/10.1016/j.knosys.2024.111852

Copy DOI

Export

Save

Cite

Journal: Knowledge-Based Systems	Publication Date: May 9, 2024
Citations: 2

Affiliation: Lanzhou University

Abstract
Full-Text
Similar Papers

Abstract

Listen

Attention mechanisms play a crucial role in improving action recognition performance. A video, a type of 3D data, can be effectively explored using attention mechanisms from temporal, spatial, and channel dimensions. However, existing methods based on 2D CNN tend to deal with complex spatiotemporal information from one or two of the dimensions, which eventually hampers their overall performance. In this paper, we propose a novel Comprehensive Attention Network (CANet) to model spatiotemporal information in all three dimensions adaptively. CANet is composed of three core plug-and-play components, namely the Global Guided Short-term Motion Module (GG-SMM), the Second-order Guided Long-term Motion Module (SG-LMM), and the Spatial Motion Adaptive Module (SMAM). Specifically, (1) the GG-SMM module is designed to represent local motion clues in the short-term temporal dimension to improve the classification accuracy of fast-tempo actions. (2) The SG-LMM module is designed to jointly motivate fine-grained motion information in the long-term temporal and channel dimensions, thereby facilitating the discrimination of long-term motions. (3) The SMAM module is used to represent motion-sensitive regions in the spatial dimension by learning the spatial object offsets. Extensive experiments have been conducted on four widely used action recognition benchmarks, namely, Something-Something V1, Kinetics-400, UCF-101, and HMDB-51. Experimental results demonstrate that the proposed CANet achieves excellent performance compared with other state-of-the-art methods.

Full Text