Abstract

The advent of cost-effectiveness and easy-operation depth cameras has facilitated a variety of visual recognition tasks including human activity recognition. This paper presents a novel framework for recognizing human activities from video sequences captured by depth cameras. We extend the surface normal to polynormal by assembling local neighboring hypersurface normals from a depth sequence to jointly characterize local motion and shape information. We then propose a general scheme of super normal vector (SNV) to aggregate the low-level polynormals into a discriminative representation, which can be viewed as a simplified version of the Fisher kernel representation. In order to globally capture the spatial layout and temporal order, an adaptive spatio-temporal pyramid is introduced to subdivide a depth video into a set of space-time cells. In the extensive experiments, the proposed approach achieves superior performance to the state-of-the-art methods on the four public benchmark datasets, i.e., MSRAction3D, MSRDailyActivity3D, MSRGesture3D, and MSRActionPairs3D.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call