Abstract

The dense trajectories and low-level local features are widely used in action recognition recently. However, most of these methods ignore the motion part of action which is the key factor to distinguish the different human action. This paper proposes a new two-layer model of representation for action recognition by describing the video with low-level features and mid-level motion part model. Firstly, we encode the compensated flow (w-flow) trajectory-based local features with Fisher Vector (FV) to retain the low-level characteristic of motion. Then, the motion parts are extracted by clustering the similar trajectories with spatiotemporal distance between trajectories. Finally the representation for action video is the concatenation of low-level descriptors encoding vector and motion part encoding vector. It is used as input to the LibSVM for action recognition. The experiment results demonstrate the improvements on J-HMDB and YouTube datasets, which obtain 67.4% and 87.6%, respectively.

Highlights

  • Human action recognition has become a hot topic in the field of computer vision

  • The state of the art method is popular Fisher Vector (FV) [5] encoding model based on spatiotemporal local features

  • All these methods are not perfect, because they are only concerned about the low-level spatiotemporal features based on interest point and ignored the higher level features of motion part

Read more

Summary

Introduction

Human action recognition has become a hot topic in the field of computer vision. It has developed a practical system which will be applied to video surveillance, interactive gaming, and video annotation. Many existing researches on human action recognition tend to extract features from whole 3D videos using spatiotemporal interest points (STIP) [4]. Local trajectory-based features are pooled and normalized to a vector as the video global representation in action recognition. The state of the art method is popular Fisher Vector (FV) [5] encoding model based on spatiotemporal local features. Inspired by low-level local feature encoding and mid-level motion part model are key factors to distinguish the different human actions; we propose a new representation (depicted in Figure 2) for action. We represent the video through combining the low-level trajectory-based features encoding model with mid-level motion part model.

First Layer with FV
Representation for Video
Experiments
Findings
Conclusions
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call