Abstract
In recent years, skeleton-based human action recognition (HAR) approaches using convolutional neural network (CNN) models have made tremendous progress in computer vision applications. However, using relative features to depict human actions, in addition to preventing overfitting when the CNN model is trained on a few samples, is still a challenge. In this paper, a new motion image is introduced to transform spatial-temporal motion information into image-based representations. For each skeleton sequence, three relative features are extracted to describe human actions. The three relative features are consisted of relative coordinates, immediate displacement, and immediate motion orientation. In particular, the relative coordinates introduced in our paper not only depict the spatial relations of human skeleton joints but also provide long-term temporal information. To address the problem of small sample sizes, a data augmentation strategy consisting of three simple but effective data augmentation methods is proposed to expand the training samples. Because the generated color images are small in size, a shallow CNN model is suitable to extract the deep features of the generated motion images. Two small-scale but challenging skeleton datasets were used to evaluate the method, scoring 96.59% and 97.48% on the Florence 3D Actions dataset and UTkinect-Action 3D dataset, respectively. The results show that the proposed method achieved a competitive performance compared with the state-of-the-art methods. Furthermore, the augmentation strategy proposed in this paper effectively solves the overfitting problem and can be widely adopted in skeleton-based action recognition.
Highlights
Introductionhuman action recognition (HAR) has received increasing attention in the field of computer vision
In recent years, human action recognition (HAR) has received increasing attention in the field of computer vision.Because it has a wide range of industrial applications, such as human computer interaction, smart video surveillance, and health care [1]
According to [24,25], as the motion color image is small in size and simple in structure, a shallow convolutional neural network (CNN) framework is enough to extract the deep features of the generated motion images for HAR
Summary
HAR has received increasing attention in the field of computer vision. To form a final feature descriptor, HAR using handcrafted features has two stages, feature extraction and feature representation In the former stage, various sorts and varieties of motion features are proposed, such as the relative coordinates and angles between joints. To address the mentioned issues, we propose a novel skeleton-based action recognition method using a CNN model. The spatial-temporal motion data of human action are encoded into an image-based representation. To cope with varyinglength skeleton sequences, an effective skeleton sequence refinement strategy is utilized to align action sequences By this way, the generated motion images have consistent spatial. A shallow CNN model is sufficient to efficiently extract deep features because of the small size of our proposed skeleton-based motion image.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.