Abstract

Video-based action recognition, which needs to handle temporal motion and spatial cues simultaneously, remains a challenging task. In this paper, our motivation is to address this issue by fully utilizing temporal information. Specially, a novel light-weight Voting-based Temporal Correlation (VTC) module is proposed to enhance temporal information. Multiple branches with different temporal sampling intervals are included in this module and they are regarded as voters. The final classification result is “voted” by these branches together. VTC module integrates sparse temporal sampling strategy into feature sequences, so it mitigates the effect of redundant information and focuses more on temporal modeling. Additionally, we propose a simple and intuitive Similarity Loss (SL) to guide the training procedure of the VTC module and the backbone network. When we introduce confusion in the predicted vector intentionally, SL eases intra-class variation by discovering class-specific common motion patterns rather than sample-specific discriminative information. SL neither needs excessive parameter tuning during training nor adds significant computation overhead during test time. By combining VTC module and SL with complementary advances in the field, we clearly outperform state-of-the-art results and achieve 83.0, 98.4, 49.6 and 77.8 accuracy on HMDB51, UCF101, something-something-v1, and Kinetics respectively.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call