Abstract

Joint spatio-temporal feature learning is the key to video-based action recognition. Off-the-shelf techniques mostly apply two-stream networks, and they either simply fuse the classification scores or only integrate the high-level features. However, these methods cannot learn inter-modality relationship well. We propose a joint attentive (JA) adaptive feature fusion (AFF) network, a three-stream network that improves inter-modality fusion by exploring complementary and interactive information of two modalities, RGB and optical flow. Specifically, we design an AFF block to implement layer-wise fusion between both modality network channels and feature levels, where spatio-temporal feature representations with different modalities and various levels can be fused effectively. To capture three-dimensional interaction of spatio-temporal features, we devise a JA module by incorporating the inter-dependencies learned with the spatial-channel attention mechanism and combine multi-scale attention to refine the fine-grained features. Extensive experiments on three public action recognition benchmark datasets demonstrate that our method achieves competitive results.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call