Abstract

Online Action Detection (OAD) has attracted more and more attention in recent years. A network for OAD generally consists of three parts: a frame-level feature extractor, a temporal modeling module, and an action classifier. Most recent OAD networks use a single-channel Recurrent Neural Network (RNN) to capture long-term history information, with spatial and temporal features concatenated as network input. In OAD, spatial features describe object appearance and scene configuration within each frame while temporal features capture motion cues over time. It is crucial to effectively fuse both spatial and temporal features. In this paper, we propose a new framework named TwinLSTM based on two-channel Long Short-Term Memory (LSTM) network for OAD, in which each channel is used to extract and handle either spatial features or temporal features. To more effectively fuse both spatial and temporal features, we design a prediction fusion module (PFM) to utilize hidden states of both channels to obtain more action content, including information interaction and future context prediction. We evaluate TwinLSTM on two challenging datasets: THUMOS14 and HDD. Experiments show that TwinLSTM outperforms existing single-channel models by a significant margin. We also show the effectiveness of PFM through comprehensive ablation studies.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call