Abstract

We propose a new method, which uses a two-stream 3D convolution network to capture rich spatial and temporal information, then process it with an attention module to capture long- and short-term dependency, to recognize action on the videos. By taking advantages of 3D convolutions, not only spatial information is obtained, but the movement information on the videos is also captured as temporal information. The main reason to consider long-term temporal dependency information is that it will be important to identify action on the videos. The bidirectional self-attention network uses forward/backward masks to encode temporal order information, and attention to handle our sequence on 3D convolution features. The experimental results indicate that the proposed method can be compared to state-of-the-art work in the HMDB-51 dataset with a less complex process while maintaining the performance. We employ a two-stream 3D network to capture spatial-temporal features, combined with self-attention to further capture temporal relationships.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.