A Spatio-Temporal Attention Convolution Block for Action Recognition

Junjie Wang,Xueyan Wen

doi:10.1088/1742-6596/1651/1/012193

Abstract

We propose a simple and effective 3D neural network module (STAT) embedded in spatiotemporal attention for action recognition. For a given intermediate feature map, our module sequentially infers the distribution of attention along the two dimensions of space and time, and multiplies it with the current feature map in the form of residual to achieve adaptive generation of the next stage feature map. STAT is a 3D convolution general module combined with attention. It is compatible with any 3D convolution network and can easily replace the 3D convolution kernel. The additional overhead it generates is negligible, and it can be trained end-to-end together with ordinary 3DCNN. By comparing the performance of the currently popular 3D networks on the UCF101 and HMDB51 datasets, experiments show that STAT has certain improvements on most 3D networks, which proves that STAT has a certain universality.

Full Text