FSformer: Fast-Slow Transformer for video action recognition

Shibao Li,Zhaoyu Wang,Yixuan Liu,Yunwu Zhang,Jinze Zhu,Xuerong Cui,Jianhang Liu

doi:10.1016/j.imavis.2023.104740

Abstract

Two-stream networks have achieved good results on action recognition datasets by modeling the interdependence of various motions. However, previous two-stream networks focus on action modeling but ignore concentrating on the difference in importance between different short-term actions, causing the limitation of the model’s action modeling capabilities between different short-term actions. Therefore, we propose a Short-term Action Differentiated Attention (SADA) module based on the two-stream structure with different temporal resolution inputs. We embed the SADA module in a novel two-stream transformer architecture called Fast-Slow Transformer (FSformer). The SADA module dramatically pays attention to the difference between the importance of different short-term actions. It can: (i) deploy attention from the video frames to learn the differentiated knowledge of the importance of different short-term action feature information for action recognition, (ii) fuse rich importance difference knowledge and context information through a novel Fast-Slow Attention mechanism. Overall, the SADA module significantly focuses on the difference in importance of short-term actions and improves action recognition performance. We evaluate our method’s effectiveness on three challenging densely-labeled action datasets and achieve results over the state-of-the-art.

Full Text