Multi-level channel attention excitation network for human action recognition in videos

Hanbo Wu,Xin Ma,Yibin Li

doi:10.1016/j.image.2023.116940

Abstract

Channel attention mechanism has continuously attracted strong interests and shown great potential in enhancing the performance of deep CNNs. However, when applied to video-based human action recognition task, most existing methods generally learn channel attention at frame level, which ignores the temporal dependencies and may limit the recognition performance. In this paper, we propose a novel multi-level channel attention excitation (MCAE) module to model the temporal-related channel attention at both frame and video levels. Specifically, based on video convolutional feature maps, frame-level channel attention (FCA) is generated by exploring time-channel correlations, and video-level channel attention (VCA) is generated by aggregating global motion variations. MCAE firstly recalibrates video feature responses with frame-wise FCA, and then activates the motion-sensitive channel features with motion-aware VCA. MCAE module learns the channel discriminability from multiple levels and can act as a guidance to facilitate efficient spatiotemporal feature modeling in activated motion-sensitive channels. It can be flexibly embedded into 2D networks with very limited extra computation cost to construct MCAE-Net, which effectively enhances the spatiotemporal representation of 2D models for video action recognition task Extensive experiments on five human action datasets show that our method achieves superior or very competitive performance compared with the state-of-the-arts, which demonstrates the effectiveness of the proposed method for improving the performance of human action recognition.

Full Text