Abstract

Visual context is fundamental to understand human actions in videos. However, the discriminative temporal information of videos is usually sparse and most frames are redundant mixed with a large amount of interference information, which may result in redundant computation and recognition failure. Hence, an important question is how to efficiently employ temporal context information. In this paper, we propose a learnable temporal attention mechanism to automatically select important time points from action sequences. We design an unsupervised Recurrent Temporal Sparse Autoencoder (RTSAE) network, which learns to extract sparse keyframes to sharpen discriminative yet to retain descriptive capability, as well to shield interference information. By applying this technique to a dual-stream convolutional neural network, we significantly improve the performance in both accuracy and efficiency. Experiments demonstrate that, with the help of the RTSAE, our method achieves competitive results to state of the art on UCF101 and HMDB51 datasets.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.