Abstract
Human action recognition based on 3D data is attracting increasing attention because it could provide more abundant spatial and temporal information compared with RGB videos. The challenge of the depth map based method is to capture the cues between spatial appearances and temporal motions. In this paper, we propose a straightforward and efficient framework for modeling the human action based on depth map sequences, considering the short-term and long-term dependencies. A frame-level feature, termed depth-oriented gradient vector (DOGV), is developed to capture the appearance and motion in a short-term duration. For a long-term dependence, we construct convolutional neural networks (CNNs) based backbone to aggregate frame-level features in the space and time sequence. The proposed method is comprehensively evaluated on four public benchmark datasets, including NTU RGB+D, NTU RGB+D 120, PKU-MMD and UOW LSC. The experimental results demonstrate that the proposed approach can solve the problem of 3D human action recognition in an efficient way and achieve the state-of-the-art performance.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.