Residual attention fusion network for video action recognition

Ao Li,Yang Yi,Daan Liang

doi:10.1016/j.jvcir.2023.103987

Abstract

Human action recognition in videos is a fundamental and important topic in computer vision, and modeling spatial–temporal dynamics in a video is crucial for action classification. In this paper, a novel attention module named Channel-wise Non-local Attention Module (CNAM) is proposed to highlight the important features both spatially and temporally. Besides, another new attention module named Channel-wise Attention Recalibration Module (CARM) is developed to focus on capturing discriminative features at channel level. Based on these two attention modules, a novel convolutional neural network named Residual Attention Fusion Network (RAFN) is proposed to model long-range temporal structure and capture more discriminative action features at the same time. More specifically, first, a sparse temporal sampling strategy is adopted to uniformly sample video data as input to RAFN along the temporal dimension. Secondly, the attention modules CNAM and CARM are plugged into residual network for highlighting important action regions around actors. Finally, the classification scores of four streams of RAFN are combined by late fusion. The experimental results on HMDB51 and UCF101 demonstrate the effectiveness and excellent recognition performance of our proposed method.

Full Text