Abstract

Temporal information plays a significant role in video-based human action recognition. How to effectively extract the spatial–temporal characteristics of actions in videos has always been a challenging problem. Most existing methods acquire spatial and temporal cues in videos individually. In this article, we propose a new effective representation for depth video sequences, called hierarchical dynamic depth projected difference images that can aggregate the action spatial and temporal information simultaneously at different temporal scales. We firstly project depth video sequences onto three orthogonal Cartesian views to capture the 3D shape and motion information of human actions. Hierarchical dynamic depth projected difference images are constructed with the rank pooling in each projected view to hierarchically encode the spatial–temporal motion dynamics in depth videos. Convolutional neural networks can automatically learn discriminative features from images and have been extended to video classification because of their superior performance. To verify the effectiveness of hierarchical dynamic depth projected difference images representation, we construct a hierarchical dynamic depth projected difference images–based action recognition framework where hierarchical dynamic depth projected difference images in three views are fed into three identical pretrained convolutional neural networks independently for finely retuning. We design three classification schemes in the framework and different schemes utilize different convolutional neural network layers to compare their effects on action recognition. Three views are combined to describe the actions more comprehensively in each classification scheme. The proposed framework is evaluated on three challenging public human action data sets. Experiments indicate that our method has better performance and can provide discriminative spatial–temporal information for human action recognition in depth videos.

Highlights

  • Human action recognition has attracted increasing attention throughout the computer vision community over the past years

  • Since the original depth map sequences are sampled progressively along the time, so dynamic depth projected difference image (DDPDI) are dynamically progressive along the temporal scale for human actions. These DDPDIs at different temporal scales in each projected view for a depth video are named as hierarchical dynamic depth projected difference images (HDDPDI), which can be used as an effective representation for the video

  • We propose the HDDPDI representation for a depth video to describe the spatial–temporal dynamics of human actions from different temporal scales

Read more

Summary

Introduction

Human action recognition has attracted increasing attention throughout the computer vision community over the past years. We construct a CNN-based action recognition framework with the proposed hierarchical dynamic depth projected difference images (HDDPDI). Bilen et al.[30] applied rank pooling[31] that is an effective temporal pooling method on the raw image pixels of a RGB video sequence to produce the RGB dynamic image Inspired by this idea, we extend dynamic image to depth data and propose that utilizing rank pooling encodes a DPDI sequence to generate the dynamic depth projected difference image (DDPDI). To capture the temporal information effectively, we apply rank pooling[31] on DPDI sequences in each projected view to get DDPDIs that include the spatial–temporal variances of the whole video. With the help of rank pooling and dynamic image, this method overcomes the drawback of ignoring video temporal information in original DMM3 and improves the discrimination of human action recognition.

Related works
Experiments and discussions
Method
Findings
Conclusion and future work
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call