Abstract

Recently, deep convolutional networks (ConvNets) have achieved remarkable progress for action recognition in videos. Most existing deep frameworks treat a video as an unordered frame sequence, and make a prediction by averaging the output of single RGB image, or stacked optical flow field. However, within a video, complex actions may consist of several atomic actions carried out in a sequential manner during its temporal range. To address this issue, we propose a deep learning framework, sequential segment networks (SSN), to model video-level temporal structures in videos. We get several short video snippets by a sparse temporal sampling strategy, and then concatenate the output of ConvNets learned from short snippets; finally, the concatenated consensus vector is fed into a fully connected layer to learn its temporal structure. The sparse sampling strategy and video-level structure enable an efficient and effective training process for SSNs. Extensive empirical studies demonstrate that action recognition performance can be significantly improved by mining temporal structures, and our approach achieves state-of-the-art performance on UCF101 and HMDB51 datasets.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call