Most 3D skeleton feature-based human action recognition methods are sensitive to changes in viewpoints, motion scales, and human scales. In addition, acquiring depth information from a real scene in outdoor environments results in poor precision or high computational costs. To address these drawbacks, in this study, we propose a new RGB video and 2D skeleton-based action recognition method including local joint trajectory volume representation and feature coding. First, a video is transferred to a set of volumes, which are called the local joint trajectory volumes. Then, hand-crafted and convolutional networks are used to calculate the features of each volume-based RGB image sequence. Different from most works that use convolutional networks to learn global video features, in this paper, the problem of using a convolutional network to represent local video regions is discussed. Finally, the feature set of each joint is encoded into the Fisher vector as an action feature. The classifier is trained by a linear SVM. The experimental results show that skeleton joint-based features result in a more compact and effective action representation approach than other approaches.
Read full abstract