Abstract

Recognition of human actions is to classify actions in a video. Recently, Vision Transformer (ViT) has been applied to action recognition. However, the Vision Transformer is unsuitable for high-resolution input videos due to the constraint of computing power since ViT splits frames into fixed-size patches embedded (i.e., tokens) with absolute-position information and adopts a pure Transformer encoder to model the relationships among these tokens. To address this issue, we propose a relative-position embedding based spatially and temporally decoupled Transformer (RPE-STDT) for action recognition, which can capture spatial–temporal information by stacked self-attention layers. The proposed RPE-STDT model consists of two separate series of Transformer encoders. The first series of encoders is the spatial Transformer encoders, which model interactions between tokens extracted from the same temporal index. The second series of encoders is the temporal Transformer encoders, which model interactions across time dimensions with a subsampling strategy. Furthermore, we replace the absolute-position embeddings in the Vision Transformer encoders with the proposed relative-position embeddings to capture the order of the embedded tokens to reduce computational costs. Finally, we conduct thorough ablation studies. Our RPE-STDT achieves state-of-the-art results on multiple action recognition datasets, exceeding prior convolution and Transformer-based networks.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call