Abstract

In this work, we address the problem of human action recognition in videos. We propose and analyze a multistream architecture containing image-based networks pre-trained on the large ImageNet. Different image representations are extracted from the videos to feed the streams, in order to provide complementary information for the system. Here, we propose new streams based on visual rhythm that encodes longer-term information when compared to still frames and optical flow. Our main contribution is a stream based on a new variant of the visual rhythm called Learnable Visual Rhythm (LVR) formed by the outputs of a deep network. The features are collected at multiple depths to enable the analysis of different abstraction levels. This strategy significantly outperforms the handcrafted version on the UCF101 and HMDB51 datasets. We also investigate many combinations of the streams to identify the modalities that better complement each other. Experiments conducted on the two datasets show that our multi-stream network achieved competitive results compared to state-of-the-art approaches.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call