Video representation learning for temporal action detection using global-local attention

Yiping Tang,Yang Zheng,Chen Wei,Kaitai Guo,Haihong Hu,Jimin Liang

doi:10.1016/j.patcog.2022.109135

Abstract

Video representation is of significant importance for temporal action detection. The two sub-tasks of temporal action detection, i.e., action classification and action localization, have different requirements for video representation. Specifically, action classification requires video representations to be highly discriminative, so that action features and background features are as dissimilar as possible. For action localization, it is crucial to obtain information about the action itself and the surrounding context for accurate prediction of action boundaries. However, the previous methods failed to extract the optimal representations for the two sub-tasks, whose representations for both sub-tasks are obtained in a similar way. In this paper, a Global-Local Attention (GLA) mechanism is proposed to produce a more powerful video representation for temporal action detection without introducing additional parameters. The global attention mechanism predicts each action category by integrating features in the entire video that are similar to the action while suppressing other features, thus enhancing the discriminability of video representation during the training process. The local attention mechanism uses a Gaussian weighting function to integrate each action and its surrounding contextual information, thereby enabling precise localization of the action. The effectiveness of GLA is demonstrated on THUMOS’14 and ActivityNet-1.3 with a simple one-stage action detection network, achieving state-of-the-art performance among the methods using only RGB images as input. The inference speed of the proposed model reaches 1373 FPS on a single Nvidia Titan Xp GPU. The generalizability of GLA to other detection architectures is verified using R-C3D and Decouple-SSAD, both of which achieve consistent improvements. The experimental results demonstrate that designing representations with different properties for the two sub-tasks leads to better performance for temporal action detection compared to the representations obtained in a similar way.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Video representation learning for temporal action detection using global-local attention

Abstract

Talk to us

Similar Papers

More From: Pattern Recognition

Lead the way for us

Journal: Pattern Recognition	Publication Date: Oct 28, 2022
Citations: 6

Similar Papers

3D hierarchical dual-attention fully convolutional networks with hybrid losses for diverse glioma segmentation
Deting Kong ... Jie Xue
Knowledge-Based Systems | VOL. 237
Deting Kong, et. al.Deting Kong ... Jie Xue
14 Nov 2021
Knowledge-Based Systems | VOL. 237

English-Chinese Machine Translation Model Based on Bidirectional Neural Network with Attention Mechanism
Li Yonglan ... He Wenjia
Journal of Sensors | VOL. 2022
Li Yonglan, et. al.Li Yonglan ... He Wenjia
17 Mar 2022
Journal of Sensors | VOL. 2022

Capsule Boundary Network With 3D Convolutional Dynamic Routing for Temporal Action Detection
Yaosen Chen ... Bing Guo
IEEE Transactions on Circuits and Systems for Video Technology | VOL. 32
Yaosen Chen, et. al.Yaosen Chen ... Bing Guo
01 May 2022
IEEE Transactions on Circuits and Systems for Video Technology | VOL. 32

Temporal Action Detection by Joint Identification-Verification
Wen Wang ... Jian Cheng
-
Wen Wang, et. al.Wen Wang ... Jian Cheng
01 Aug 2018
01 Aug 2018

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Video representation learning for temporal action detection using global-local attention

Abstract

Talk to us

Similar Papers

More From: Pattern Recognition