Abstract

We propose a new method, which uses a two-stream 3D convolution network to capture rich spatial and temporal information, then process it with an attention module to capture long- and short-term dependency, to recognize action on the videos. By taking advantages of 3D convolutions, not only spatial information is obtained, but the movement information on the videos is also captured as temporal information. The main reason to consider long-term temporal dependency information is that it will be important to identify action on the videos. The bidirectional self-attention network uses forward/backward masks to encode temporal order information, and attention to handle our sequence on 3D convolution features. The experimental results indicate that the proposed method can be compared to state-of-the-art work in the HMDB-51 dataset with a less complex process while maintaining the performance. We employ a two-stream 3D network to capture spatial-temporal features, combined with self-attention to further capture temporal relationships.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call