An Efficient Spatio-Temporal Pyramid Transformer for Action Detection

Yuetian Weng,Bohan Zhuang,Mingfei Han,Zizheng Pan,Xiaojun Chang

doi:10.1007/978-3-031-19830-4_21

Abstract

AbstractThe task of action detection aims at deducing both the action category and localization of the start and end moment for each action instance in a long, untrimmed video. While vision Transformers have driven the recent advances in video understanding, it is non-trivial to design an efficient architecture for action detection due to the prohibitively expensive self-attentions over a long sequence of video clips. To this end, we present an efficient hierarchical Spatio-Temporal Pyramid Transformer (STPT) for action detection, building upon the fact that the early self-attention layers in Transformers still focus on local patterns. Specifically, we propose to use local window attention to encode rich local spatio-temporal representations in the early stages while applying global attention modules to capture long-term space-time dependencies in the later stages. In this way, our STPT can encode both locality and dependency with largely reduced redundancy, delivering a promising trade-off between accuracy and efficiency. For example, with only RGB input, the proposed STPT achieves 53.6% mAP on THUMOS14, surpassing I3D+AFSD RGB model by over 10% and performing favorably against state-of-the-art AFSD that uses additional flow features with 31% fewer GFLOPs, which serves as an effective and efficient end-to-end Transformer-based framework for action detection. Code is available at https://github.com/ziplab/STPT. KeywordsAction detectionEfficient video transformers

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

An Efficient Spatio-Temporal Pyramid Transformer for Action Detection

Abstract

Talk to us

Similar Papers

Lead the way for us

Similar Papers

Capsule Boundary Network With 3D Convolutional Dynamic Routing for Temporal Action Detection
Yaosen Chen ... Bing Guo
IEEE Transactions on Circuits and Systems for Video Technology | VOL. 32
Yaosen Chen, et. al.Yaosen Chen ... Bing Guo
01 May 2022
IEEE Transactions on Circuits and Systems for Video Technology | VOL. 32

Deep Learning-Based Action Detection in Untrimmed Videos: A Survey.
Elahe Vahdani ... Yingli Tian
IEEE Transactions on Pattern Analysis and Machine Intelligence | VOL. 45
Elahe Vahdani, et. al.Elahe Vahdani ... Yingli Tian
01 Apr 2023
IEEE Transactions on Pattern Analysis and Machine Intelligence | VOL. 45

Learning an Augmented RGB Representation with Cross-Modal Knowledge Distillation for Action Detection
Rui Dai ... Francois Bremond
-
Rui Dai, et. al.Rui Dai ... Francois Bremond
01 Oct 2021
01 Oct 2021

Human locomotion
Alberto Minetti
Journal of Biomechanics | VOL. 40
Alberto MinettiAlberto Minetti
01 Jan 2007
Journal of Biomechanics | VOL. 40

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

An Efficient Spatio-Temporal Pyramid Transformer for Action Detection

Abstract

Talk to us

Similar Papers