A Multitemporal Scale and Spatial–Temporal Transformer Network for Temporal Action Localization

Zan Gao,An-An Liu,Xinglei Cui,Meng Wang,Tao Zhuo,Zhiyong Cheng,Shenyong Chen

doi:10.1109/thms.2023.3266037

Abstract

Temporal action localization plays an important role in video analysis, which aims to localize and classify actions in untrimmed videos. Previous methods often predict actions on a feature space of a single temporal scale. However, the temporal features of a low-level scale lack sufficient semantics for action classification, while a high-level scale cannot provide the rich details of the action boundaries. In addition, the long-range dependencies of video frames are often ignored. To address these issues, a novel multitemporal-scale spatial–temporal transformer (MSST) network is proposed for temporal action localization, which predicts actions on a feature space of multiple temporal scales. Specifically, we first use refined feature pyramids of different scales to pass semantics from high-level scales to low-level scales. Second, to establish the long temporal scale of the entire video, we use a spatial–temporal transformer encoder to capture the long-range dependencies of video frames. Then, the refined features with long-range dependencies are fed into a classifier for coarse action prediction. Finally, to further improve the prediction accuracy, we propose a frame-level self-attention module to refine the classification and boundaries of each action instance. Most importantly, these three modules are jointly explored in a unified framework, and MSST has an anchor-free and end-to-end architecture. Extensive experiments show that the proposed method can outperform state-of-the-art approaches on the THUMOS14 dataset and achieve comparable performance on the ActivityNet1.3 dataset. Compared with A2Net (TIP20, Avg{0.3:0.7}), Sub-Action (CSVT2022, Avg{0.1:0.5}), and AFSD (CVPR21, Avg{0.3:0.7}) on the THUMOS14 dataset, the proposed method can achieve improvements of 12.6%, 17.4%, and 2.2%, respectively.

Full Text

Published version (

Free)

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

A Multitemporal Scale and Spatial–Temporal Transformer Network for Temporal Action Localization

Abstract

Talk to us

Similar Papers

More From: IEEE Transactions on Systems, Man and Cybernetics, Part C (Applications and Reviews)

Lead the way for us

Journal: IEEE Transactions on Systems, Man and Cybernetics, Part C (Applications and Reviews)	Publication Date: Jun 1, 2023
Citations: 3

Similar Papers

A Temporal-Aware Relation and Attention Network for Temporal Action Localization.
Yibo Zhao ... Jie Nie
IEEE Transactions on Image Processing | VOL. 31
Yibo Zhao, et. al.Yibo Zhao ... Jie Nie
01 Jan 2021
IEEE Transactions on Image Processing | VOL. 31

End-to-End Temporal Action Detection Using Bag of Discriminant Snippets
Fiza Murtaza ... Yu Qian
IEEE Signal Processing Letters | VOL. 26
Fiza Murtaza, et. al.Fiza Murtaza ... Yu Qian
01 Feb 2019
IEEE Signal Processing Letters | VOL. 26

Spatiotemporal Multi-Task Network for Human Activity Understanding
Yao Liu ... Xian-Sheng Hua
-
Yao Liu, et. al.Yao Liu ... Xian-Sheng Hua
23 Oct 2017
23 Oct 2017

Action Unit Memory Network for Weakly Supervised Temporal Action Localization
Wang Luo ... Wenfei Yang
-
Wang Luo, et. al.Wang Luo ... Wenfei Yang
01 Jun 2021
01 Jun 2021

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

A Multitemporal Scale and Spatial–Temporal Transformer Network for Temporal Action Localization

Abstract

Talk to us

Similar Papers

More From: IEEE Transactions on Systems, Man and Cybernetics, Part C (Applications and Reviews)