Abstract

3D CNNs are powerful tools for action recognition that can intuitively extract spatio-temporal features from raw videos. However, most of the existing 3D CNNs have not fully considered the disadvantageous effects of the background motion that frequently appears in videos. The background motion is usually misclassified as a part of human action, which may undermine modeling the dynamic pattern of the action. In this paper, we propose the residual attention unit (RAU) to address this problem. RAU aims to suppress the background motion by upweighting the values associated with the foreground region in the feature maps. Specifically, RAU contains two separate submodules in parallel, i.e., spatial attention as well as channel-wise attention. Given an intermediate feature map, the spatial attention works in a bottom-up top-down manner to generate the attention mask, while the channel-wise attention recalibrates the feature responses of all channels automatically. As applying the attention mechanism directly to the input features may lead to the loss of discriminative information, we design a bypass to preserve the integrity of the original features by a shortcut connection between the input and output of the attention module. Notably, our RAU can be embedded into 3D CNNs easily and enables end-to-end training along with the networks. The experimental results on UCF101 and HMDB51 demonstrate the validity of our RAU.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.