Abstract

In this paper, we propose a novel temporal spiking recurrent neural network (TSRNN) to perform robust action recognition in videos. The proposed TSRNN employs a novel spiking architecture which utilizes the local discriminative features from high-confidence reliable frames as spiking signals. The conventional CNN-RNNs typically used for this problem treat all the frames equally important such that they are error-prone to noisy frames. The TSRNN solves this problem by employing a temporal pooling architecture which can help RNN select sparse and reliable frames and enhances its capability in modelling long-range temporal information. Besides, a message passing bridge is added between the spiking signals and the recurrent unit. In this way, the spiking signals can guide RNN to correct its long-term memory across multiple frames from contamination caused by noisy frames with distracting factors ( e.g. , occlusion, rapid scene transition). With these two novel components, TSRNN achieves competitive performance compared with the state-of-the-art CNN-RNN architectures on two large scale public benchmarks, UCF101 and HMDB51.

Highlights

  • Human action recognition in videos has drawn growing attention in computer vision, owing to its broad practical applications in many areas such as visual surveillance, behavior analysis, and virtual reality [1]–[5]

  • To recognize actions more robustly even in the presence of noisy frames, we propose a novel temporal spiking recurrent neural network (TSRNN)

  • Our contribution can be summarized as follows: (i) We propose a novel temporal spiking recurrent neural network (TSRNN) where the pooling operation is implemented at the frame-level instead of the pixellevel

Read more

Summary

INTRODUCTION

Human action recognition in videos has drawn growing attention in computer vision, owing to its broad practical applications in many areas such as visual surveillance, behavior analysis, and virtual reality [1]–[5]. The reason is that the CNN-based methods only learn the local visual appearance of each frame and are limited in modeling the long-term cross-frame motion and other dynamics from a global view, leading to inferior performance To address this issue, some works [8], [9], [18], [19] propose to build recurrent neural networks (RNNs) upon CNNs for capturing the long-term information. These methods treat the information from all the frames important and this inevitably introduces noise from some ‘‘bad’’ frames caused by occlusion, fast moving or rapid scene transition Such noise will contaminate the representations learned by the RNN in an accumulative way, which may bring irreparable damage to the final action recognition result.

RELATED WORK
KEY-FRAME BRANCH
TEMPORAL CONTEXT BRANCH
ACCUMULATIVE LOSS FUNCTION
FUSION OF RGB-TSRNN AND OF-TSRNN
Findings
CONCLUSION
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.