Abstract

This paper presents an effective multiscale energy-based global ternary image (E-GTI) representation for action recognition from depth sequences. The unique property of our representation is that it takes the spatiotemporal discrimination and action speed variations into account, intending to solve the problems of distinguishing similar actions and identifying the actions with different speeds in one goal. The entire method is carried out in two stages. In the first stage, consecutive depth frames are used to generate global ternary image (GTI) features, which implicitly capture both inter-frame motion regions and motion directions. Specifically, each pixel in the GTI represents one of three possible states, namely, positive, negative, and neutral, which indicate the increased, decreased, and same depth values, respectively. To cope with speed variations in actions, energy-based sampling method is utilized, leading to multiscale E-GTI features, where the multiscale scheme can efficiently capture the temporal relationships among frames. In the second stage, all the E-GTI features are transformed by Radon transform (RT) as robust descriptors, which are aggregated by the bag-of-visual-words model as a compact representation. Extensive experiments on benchmark data sets show that our representation outperforms state-of-the-art approaches, since it captures discriminating spatiotemporal information of actions. Due to the merits of energy-based sampling and RT methods, our representation shows robustness to speed variations, depth noise, and partial occlusions.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call