An empirical study on temporal modeling for online action detection

Wen Wang,Jian Cheng,Yu Qiao,Xiaojiang Peng

doi:10.1007/s40747-021-00534-3

Abstract

Online action detection (OAD) is a practical yet challenging task, which has attracted increasing attention in recent years. A typical OAD system mainly consists of three modules: a frame-level feature extractor which is usually based on pre-trained deep Convolutional Neural Networks (CNNs), a temporal modeling module, and an action classifier. Among them, the temporal modeling module is crucial which aggregates discriminative information from historical and current features. Though many temporal modeling methods have been developed for OAD and other topics, their effects are lack of investigation on OAD fairly. This paper aims to provide an empirical study on temporal modeling for OAD including four meta types of temporal modeling methods, i.e. temporal pooling, temporal convolution, recurrent neural networks, and temporal attention, and uncover some good practices to produce a state-of-the-art OAD system. Many of them are explored in OAD for the first time, and extensively evaluated with various hyper parameters. Furthermore, based on our empirical study, we present several hybrid temporal modeling methods. Our best networks, i.e. , the hybridization of DCC, LSTM and M-NL, and the hybridization of DCC and M-NL, which outperform previously published results with sizable margins on THUMOS-14 dataset (48.6% vs. 47.2%) and TVSeries dataset (84.3% vs. 83.7%).

Highlights

Online action detection (OAD) is an important problem in computer vision, which has a wide range of applications like visual surveillance, human–computer interaction, and intelligent robot navigation, etc
Our study mainly focuses on the temporal online action detection problem, and we ignore ‘temporal’ for convenience in the rest
– We provide a fair empirical study on eleven temporal modeling methods for online action detection and many of these methods are introduced into OAD for the first time, such as temporal convolution (TC), pyramid dilated temporal convolution (PDC), dilated causal convolution (DCC), non-local, etc

Summary

Introduction

Online action detection (OAD) is an important problem in computer vision, which has a wide range of applications like visual surveillance, human–computer interaction, and intelligent robot navigation, etc. Different from traditional action recognition and offline action detection that intend to recognize actions from fully observed videos, the goal of online action detection is to detect an action as it happens and ideally even before the action is fully completed. It is a very challenging problem due to the extra restriction on the usage of only historical and current information except for the difficulties of traditional action recognition in untrimmed video streams. Our study mainly focuses on the temporal online action detection problem, and we ignore ‘temporal’ for convenience in the rest

Objectives

Methods

Results

Conclusion