Abstract

Effective feature learning is one of the prime components for human action recognition algorithm. Three-dimensional convolutional neural network (3D CNN) can directly extract spatio-temporal features, however it is insufficient to capture the most discriminative part of the action video. The redundant spatial regions within and between temporal frames would weak the descriptive ability of the 3D CNN model. To address this problem, we propose a lightweight spatio-temporal attention module (ST-AM), composed of spatial attention module (SAM) and temporal attention module (TAM). SAM and TAM can effectively encode the semantic spatial areas and suppress the redundant temporal frames to reduce misclassification. The proposed SAM and TAM have complementary effects and can be easily embedded into the existing 3D CNN action recognition model. Experiment on UCF-101 and HMDB-51 datasets shows that the ST-AM embedded model achieves impressive performance on action recognition task.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call