Exploiting spatio-temporal representation for 3D human action recognition from depth map sequences

Xiaopeng Ji,Qingsong Zhao,Jun Cheng,Chenfei Ma

doi:10.1016/j.knosys.2021.107040

Abstract

Human action recognition based on 3D data is attracting increasing attention because it could provide more abundant spatial and temporal information compared with RGB videos. The challenge of the depth map based method is to capture the cues between spatial appearances and temporal motions. In this paper, we propose a straightforward and efficient framework for modeling the human action based on depth map sequences, considering the short-term and long-term dependencies. A frame-level feature, termed depth-oriented gradient vector (DOGV), is developed to capture the appearance and motion in a short-term duration. For a long-term dependence, we construct convolutional neural networks (CNNs) based backbone to aggregate frame-level features in the space and time sequence. The proposed method is comprehensively evaluated on four public benchmark datasets, including NTU RGB+D, NTU RGB+D 120, PKU-MMD and UOW LSC. The experimental results demonstrate that the proposed approach can solve the problem of 3D human action recognition in an efficient way and achieve the state-of-the-art performance.

Full Text