Abstract

Recognizing continuous human action is a fundamental task in many real-world computer vision applications including video surveillance, video retrieval, and human-computer interaction, etc. It requires to recognize each action performed as well as their segmentation boundaries in a continuous sequence. In previous works, great progress has been reported for single action recognition, by using deep convolutional networks. In order to further improve the performance for continuous action recognition, in this paper, we introduce a discriminative approach consisting of three modules. The first feature extraction module uses a two stream Convolutional Neural Network to capture the appearance and the short-term motion information from the raw video input. Based on the obtained features, the second classification module performs spatial and temporal recognition and then fuses the two scores from respective feature stream. In the final segmentation module, a semi-Markov Conditional Field model, capable of handling long-term action interactions, is built to partition the action sequence. As can be seen in the experimental results, our approach obtains state-of-the-art performance on public datasets including 50Salads, Breakfast, and MERL Shopping. We have also visualized the continuous actions segmentation results for more insightful discussion in the paper.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call