Abstract

Temporal action detection is a challenging task in video understanding, due to the complexity of the background and rich action content impacting high-quality temporal proposals generation in untrimmed videos. Capsule networks can avoid some limitations of the invariance caused by pooling and inability from convolutional neural networks, which can better understand the temporal relations for temporal action detection. However, because of the extremely computationally expensive procedure, capsule network is difficult to be applied to the task of temporal action detection. To address this issue, this paper proposes a novel U-shaped capsule network framework with a k-Nearest Neighbor (k-NN) mechanism of 3D convolutional dynamic routing, which we named U-BlockConvCaps. Furthermore, we build a Capsules Boundary Network (CapsBoundNet) based on U-BlockConvCaps for dense temporal action proposal generation. Specifically, the first module is one 1D convolutional layer for fusing the two-stream with RGB and optical flow video features. The sampling module further processes the fused features to generate the 2D start-end action proposal feature maps. Then, the multi-scale U-Block convolutional capsule module with 3D convolutional dynamic routing is used to process the proposal feature map. Finally, the feature maps generated from the CapsBoundNet are used to predict starting, ending, action classification, and action regression score maps, which help to capture the boundary and intersection over union features. Our work innovatively improves the dynamic routing algorithm of capsule networks and extends the use of capsule networks to the temporal action detection task for the first time in the literature. The experimental results on benchmarks THUMOS14 show that the performance of CapsBoundNet is obviously beyond the state-of-the-art methods, e.g., the mAP@tIoU = 0.3, 0.4, 0.5 on THUMOS14 are improved from 63.6% to 70.0%, 57.8% to 63.1%, 51.3% to 52.9%, respectively. We also got competitive results on the action detection dataset of ActivityNet1.3.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call