Abstract

Modeling and recognizing video activities from videos are key parts of many promising applications such as visual surveillance, human–computer interaction, and video summarization. However, current approaches mainly suffer from two issues: (a) Short-term local and global spatial features are not well represented. The spatial redundancy and dependency have not been well considered on CNN-based action recognition, which may result in a further increase in both memory and computation cost. (b) Long-term temporal consistency is not well captured. The action consistency across multiple clips has been ignored in video-level action recognition approaches. To address these two issues, we propose a Self-Attentive Octave ResNet with Temporal Consistency (SOR-TC) for compressed video action recognition to better capture the short-term and long-term features in video and improve the efficiency and effectiveness of action recognition. In addition, this paper introduces a consistency hypothesis that adjacent clips should predict similar actions. So the consistency loss function is designed to learn the correlation of clips. Finally, extensive experimental results on two benchmark datasets HMDB-51 and UCF-101 verify the effectiveness of our proposed method.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call