Abstract

The irregularity of human actions poses great challenges in video action recognition. Recently, 3D ConvNet methods have shown promising performance at modelling the motion and appearance information. However, the fixed geometric structure of 3D convolution filters largely limits the learning capacity for video action recognition. To address this problem, this paper proposes a spatio-temporal deformable ConvNet module with an attention mechanism, which takes into consideration the mutual correlations in both temporal and spatial domains, to effectively capture the long-range and long-distance dependencies in the video actions. Our attention based deformable module, as a generic module for 3D ConvNets, can adaptively learn more accurate spatio-temporal offsets to model the action irregularity. The experiments on two popular datasets (UCF-101 and HMDB-51) demonstrate that our module significantly outperforms the state-of-the-art methods.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.