
Effectively modeling spatio-temporal information in the videos is the key to improving the performance of action recognition. In this work, we propose 3D residual networks with channel and spatial attention modules for action recognition. The proposed network architecture can directly extract spatio-temporal features. Channel attention module and spatial attention module can effectively assist the network to learn what and where to emphasize or suppress, at virtually negligible increase in computation cost. Specifically, we sequentially add channel attention module and spatial attention module to each slice tensor of the intermediate feature map to form channel and spatial attention maps. Then the attention maps are multiplied to the input feature map to reweight important features. We validate our network through extensive experiments and visualization method on the datasets of HMDB-51 and UCF-101.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call