Hierarchical joint contrastive learning with knowledge distillation for self-supervised 3D skeleton-based action recognition
Hierarchical joint contrastive learning with knowledge distillation for self-supervised 3D skeleton-based action recognition
- Research Article
165
- 10.1016/j.jvcir.2021.103055
- Mar 1, 2021
- Journal of Visual Communication and Image Representation
Human pose estimation and its application to action recognition: A survey
- Research Article
125
- 10.34133/cbsystems.0100
- Jan 1, 2024
- Cyborg and Bionic Systems
Three-dimensional skeleton-based action recognition (3D SAR) has gained important attention within the computer vision community, owing to the inherent advantages offered by skeleton data. As a result, a plethora of impressive works, including those based on conventional handcrafted features and learned feature extraction methods, have been conducted over the years. However, prior surveys on action recognition have primarily focused on video or red-green-blue (RGB) data-dominated approaches, with limited coverage of reviews related to skeleton data. Furthermore, despite the extensive application of deep learning methods in this field, there has been a notable absence of research that provides an introductory or comprehensive review from the perspective of deep learning architectures. To address these limitations, this survey first underscores the importance of action recognition and emphasizes the significance of 3-dimensional (3D) skeleton data as a valuable modality. Subsequently, we provide a comprehensive introduction to mainstream action recognition techniques based on 4 fundamental deep architectures, i.e., recurrent neural networks, convolutional neural networks, graph convolutional network, and Transformers. All methods with the corresponding architectures are then presented in a data-driven manner with detailed discussion. Finally, we offer insights into the current largest 3D skeleton dataset, NTU-RGB+D, and its new edition, NTU-RGB+D 120, along with an overview of several top-performing algorithms on these datasets. To the best of our knowledge, this research represents the first comprehensive discussion of deep learning-based action recognition using 3D skeleton data.
- Research Article
234
- 10.1109/tpami.2021.3053765
- Jan 22, 2021
- IEEE Transactions on Pattern Analysis and Machine Intelligence
3D skeleton-based action recognition and motion prediction are two essential problems of human activity understanding. In many previous works: 1) they studied two tasks separately, neglecting internal correlations; and 2) they did not capture sufficient relations inside the body. To address these issues, we propose a symbiotic model to handle two tasks jointly; and we propose two scales of graphs to explicitly capture relations among body-joints and body-parts. Together, we propose symbiotic graph neural networks, which contain a backbone, an action-recognition head, and a motion-prediction head. Two heads are trained jointly and enhance each other. For the backbone, we propose multi-branch multiscale graph convolution networks to extract spatial and temporal features. The multiscale graph convolution networks are based on joint-scale and part-scale graphs. The joint-scale graphs contain actional graphs, capturing action-based relations, and structural graphs, capturing physical constraints. The part-scale graphs integrate body-joints to form specific parts, representing high-level relations. Moreover, dual bone-based graphs and networks are proposed to learn complementary features. We conduct extensive experiments for skeleton-based action recognition and motion prediction with four datasets, NTU-RGB+D, Kinetics, Human3.6M, and CMU Mocap. Experiments show that our symbiotic graph neural networks achieve better performances on both tasks compared to the state-of-the-art methods.
- Research Article
31
- 10.1016/j.neucom.2022.10.016
- Oct 7, 2022
- Neurocomputing
AFE-CNN: 3D Skeleton-based Action Recognition with Action Feature Enhancement
- Research Article
2
- 10.1016/j.eswa.2025.128043
- Aug 1, 2025
- Expert Systems with Applications
• Novel study to classify simulated Movements of Interest (MOIs) by action recognition. • Acquired a novel 7-class simulated seizure MOI dataset acted by 8 epileptologists. • Image-based vs. skeleton-based action recognition are compared for MOI classification. • Highlights benefits of skeleton-based action recognition with transfer learning. • Future work should integrate skeleton-based methods with hand gesture recognition. Epileptic seizure classification based on seizure semiology requires automated, quantitative approaches to support the diagnosis of epilepsy, which affects 1% of the world’s population. Current approaches address the problem on a seizure level, neglecting the detailed evaluation of the classification of the underlying action features, also known as Movements of Interest (MOIs), which are critical for epileptologists in determining their classifications. Moreover, it hinders objective comparison of these approaches and attribution of performance differences due to datasets, intra-dataset MOI distribution, or architecture variations. Objective evaluation of action recognition techniques is crucial, with MOIs serving as foundational elements of semiology for clinical in-bed applications to facilitate epileptic seizure classification. However, until now, there were no MOI datasets available nor benchmarks comparing different action recognition approaches for this clinical problem. Therefore, as a pilot, we introduced a novel, simulated seizure semiology dataset carried out by 8 experienced epileptologists in an EMU bed, consisting of 7 MOI classes. We compare several computer vision methods for MOI classification, two image-based (I3D and Uniformerv2), and two skeleton-based (ST-GCN++ and PoseC3D) action recognition approaches. This study emphasizes the advantages of a 2-stage skeleton-based action recognition approach in a transfer learning setting (4 classes) and the multi-scale challenge of MOI classification (7 classes), advocating for the integration of skeleton-based methods with hand gesture recognition technologies in the future. The study’s controlled MOI simulation dataset provides us with the opportunity to advance the development of automated epileptic seizure classification systems, paving the way for enhancing their performance and having the potential to contribute to improved patient care.
- Conference Article
15
- 10.1109/cvprw56347.2022.00460
- Jun 1, 2022
In this work, we study self-supervised representation learning for 3D skeleton-based action recognition. We extend Bootstrap Your Own Latent (BYOL) for representation learning on skeleton sequence data and propose a new data augmentation strategy including two asymmetric transformation pipelines. We also introduce a multi-viewpoint sampling method that leverages multiple viewing angles of the same action captured by different cameras. In the semi-supervised setting, we show that the performance can be further improved by knowledge distillation from wider networks, leveraging once more the unlabeled samples. We conduct extensive experiments on the NTU-60, NTU-120 and PKU-MMD datasets to demonstrate the performance of our proposed method. Our method consistently outperforms the current state of the art on linear evaluation, semi-supervised and transfer learning benchmarks.
- Research Article
26
- 10.1016/j.neunet.2023.07.051
- Aug 22, 2023
- Neural Networks
Glimpse and focus: Global and local-scale graph convolution network for skeleton-based action recognition
- Conference Article
19
- 10.1109/mapr.2019.8743545
- May 1, 2019
Activity recognition based on skeletons has drawn a lot of attention due to its wide applications in human-computer interaction, surveillance system. Compare with image data, a skeleton has a benefit of the robustness with background changing and computing efficiently dues to its low dimensional representation. With the rise of deep neural networks, a lot of works has been applied using both CNN and LSTM networks to solve this problem. In this paper, we proposed a framework for action recognition using skeleton data and evaluate it with different network architectures. We first modify the feature representation by adding motion information to a skeleton image, which gives useful information to the networks. After that, different networks architectures have been employed and evaluated to give insight into how well it will perform on this kind of data. Finally, we evaluated the system on two public datasets NTU-RGB+D and CMDFall to show the efficiency and feasibility of the system. The proposed method achieves 76.8% and 45.23% on NTU-RGB+D and CMDFall, respectively, which is competitive results.
- Conference Article
17
- 10.1109/icpr.2016.7899764
- Dec 1, 2016
Action recognition based on human skeleton structure represents nowadays a prosper research field. This is mainly due to the recent advances in terms of capture technologies and skeleton extraction algorithms. In this context, we observed that 3D skeleton-based actions share several properties with handwritten symbols since they both result from a human performance. We accordingly hypothesize that the action recognition problem can take advantage of trial and error already carried out on handwritten patterns. Therefore, inspired by one of the most efficient and compact handwriting feature-set, we propose in this paper a skeleton descriptor referred to as Handwriting-Inspired Features (HIF3D). First of all a data preprocessing is applied to joint trajectories in order to handle the variabilities among actor's morphologies. Then we extract the HIF3D features from the processed joint locations according to a time partitioning scheme so as to additionally encode the temporal information over the sequence. Finally, we selected the Support Vector Machine (SVM) to achieve the classification step. Evaluations conducted on two challenging datasets, namely HDM05 and UTKinect, testify the soundness of our approach as the obtained results outperform the state-of-the-art algorithms that rely on skeleton data.
- Research Article
10
- 10.1016/j.sigpro.2024.109486
- Mar 30, 2024
- Signal Processing
Enhancing action recognition from low-quality skeleton data via part-level knowledge distillation
- Dissertation
- 10.32657/10356/138384
- Jan 1, 2020
Understanding human behavior from videos is a very important task in computer vision community. It is a significant sub-branch of video analysis. Human behav- ior analysis is widely applied in many application scenarios like human-computer interaction, video surveillance, video retrieval and autonomous driving. Thanks to the development of commodity depth cameras, skeleton-based human behavior analysis has drawn considerable attention in the computer vision community re- cently. Skeleton action sequences are extracted from depth camera based on pose estimation algorithms or directly detected from motion capture devices. Compared with RGB-based action sequences, skeleton-based action instances are more sim- plified and semantic. However, the limitation is that there is less appearance and few scene information provided in skeletal data. How to design suitable and stable models to understand human behavior, both body action and hand gesture, using skeletal data is an interesting and challenging topic. To well understand human behavior through action sequence, two tasks are very important, namely the ac- tion recognition and the action prediction. In this thesis, four different models are proposed to handle these distinctive tasks. \nDue to the success of deep learning models in image recognition, most of the state- of-the-arts choose to utilize deep learning as the tool for skeleton-based action recognition. However, compared with images and videos which are composed of millions or billions of pixels, the skeleton is composed by only tens of joints which is thus of much less complexity than images and videos. For such a light-weight data, non-parametric models like Naive-Bayes Nearest Neighbor (NBNN) may be more suitable than the deep learning models with high complexity. In the first two works of this thesis, two robust NBNN-based models, ST-NBNN and ST-NBMIM, are proposed to characterize skeleton sequences. Besides, to better understand skeleton-based actions, the bilinear classifiers are adopted to identify both key tem- poral stages as well as spatial joints for action classification. Although only using a linear classifier, experiments on five benchmark datasets show that by combin- ing the strength of both non-parametric and parametric models, ST-NBNN and ST-NBMIM can achieve competitive performance compared with state-of-the-art results using sophisticated models such as deep learning. Moreover, by identifying key skeleton joints and temporal stages for each action class, the two NBNN-based models can capture the essential spatio-temporal patterns that play key roles of recognizing actions, which is not always achievable by using end-to-end models. \nWhen facing the large-scale skeleton data, the non-parametric model reaches its limitation, and the deep-learning-based models demonstrate their superior perfor- mance on dataset with large size. Meanwhile, human body movements exhibit spatial patterns among pose joints. It is thus of great importance to identify those motion patterns and avoid the non-informative joints, via identifying the key combinations of joints that matter for the recognition. Although key spatio- temporal patterns discovery has been explored previously for skeleton-based action recognition, the temporal dynamics modeling of key joint combinations is not well researched in the community. In the third work of this thesis, a CNN model is proposed to adaptively search key pose joints for each action sequence. The work utilizes the deep-learning technique to train a deformable CNN model to discover sample-related key spatio-temporal patterns for action recognition. This deformable convolution better utilizes the contextual joints for action and gesture recognition and is more robust to noisy joints. The proposed model is evaluated on three benchmark datasets and the experimental results show the effectiveness of introducing temporal dynamics modeling of key joint combinations into the skeleton-based action recognition. \nThe goal of early action recognition is to predict action label when the sequence is partially observed. The existing methods treat the early action recognition task as sequential classification problems on different observation ratios of an action se- quence. Since these models are trained by differentiating positive categories from all negative classes, the diverse information of different negative categories is ignored, which we believe can be collected to help improve the recognition performance. In the last work of this thesis, a new direction, introducing category exclusion to early action recognition, is explored. The category exclusion is modeled as a mask operation on the classification probability output of a pre-trained early action recognition classifier. Specifically, policy-based reinforcement learning is utilized to train an agent. The agent generates a series of binary masks to exclude interfering negative categories during action execution and hence help improve the recogni- tion accuracy. The proposed method is evaluated on three benchmark recognition datasets, and it enhances the recognition accuracy consistently over all different observation ratios on the three datasets, where the accuracy improvements on the early stages are especially significant. \nIn summary, this thesis demonstrates the superior performance of the proposed four methods, including the ST-NBNN, ST-NBMIM, Deformable Pose Traversal Convolution, and the Category Exclusion Agent, for the tasks of action recognition and action prediction of skeleton-based sequences. These four models are exten- sively evaluated on well-known benchmark datasets and the experimental results show the effectiveness of these models on their corresponding tasks.
- Research Article
41
- 10.1007/s11042-018-5642-0
- Feb 6, 2018
- Multimedia Tools and Applications
In this paper, we present an image classification approach to action recognition with 3D skeleton videos. First, we propose a video domain translation-scale invariant image mapping, which transforms the 3D skeleton videos to color images, namely skeleton images. Second, a multi-scale dilated convolutional neural network (CNN) is designed for the classification of the skeleton images. Our multi-scale dilated CNN model could effectively improve the frequency adaptiveness and exploit the discriminative temporal-spatial cues for the skeleton images. Even though the skeleton images are very different from natural images, we show that the fine-tuning strategy still works well. Furthermore, we propose different kinds of data augmentation strategies to improve the generalization and robustness of our method. Experimental results on popular benchmark datasets such as NTU RGB + D, UTD-MHAD, MSRC-12 and G3D demonstrate the superiority of our approach, which outperforms the state-of-the-art methods by a large margin.
- Research Article
3
- 10.1109/tcsvt.2024.3399126
- Oct 1, 2024
- IEEE Transactions on Circuits and Systems for Video Technology
Skeleton-based action recognition has broad prospects owing to the fact that skeleton data is more robust to scene noise and camera view changes. Recently, researchers mainly aim to explore deep-learning feature engineering with competitive recognition accuracy for skeleton actions. However, a high-performance recognition network is usually stacked by complex feature extraction modules introducing massive computational costs. In this work, we designed a powerful and universal action knowledge distillation paradigm based on decoupled knowledge distillation for transferring action knowledge from heavy teachers to lightweight students more robustly. We constructed a network architecture space consisting of the shrinking versions of outdated 2s-AGCN and searched for several robust students. On this basis, this paradigm is further developed into a powerful decoupled knowledge embedded graph convolutional network (DKE-GCN), which outperforms the teacher significantly on three public datasets and achieves the state-of-the-art. In addition, a light-DKE-GCN is designed to achieve comparable performance with teacher with 16× less parameters, 26× less FLOPs and 8× FPS.
- Research Article
5
- 10.1155/2021/2290304
- Nov 1, 2021
- Security and Communication Networks
In skeleton-based human action recognition methods, human behaviours can be analysed through temporal and spatial changes in the human skeleton. Skeletons are not limited by clothing changes, lighting conditions, or complex backgrounds. This recognition method is robust and has aroused great interest; however, many existing studies used deep-layer networks with large numbers of required parameters to improve the model performance and thus lost the advantage of less computation of skeleton data. It is difficult to deploy previously established models to real-life applications based on low-cost embedded devices. To obtain a model with fewer parameters and a higher accuracy, this study designed a lightweight frame-level joints adaptive graph convolutional network (FLAGCN) model to solve skeleton-based action recognition tasks. Compared with the classical 2s-AGCN model, the new model obtained a higher precision with 1/8 of the parameters and 1/9 of the floating-point operations (FLOPs). Our proposed network characterises three main improvements. First, a previous feature-fusion method replaces the multistream network and reduces the number of required parameters. Second, at the spatial level, two kinds of graph convolution methods capture different aspects of human action information. A frame-level graph convolution constructs a human topological structure for each data frame, whereas an adjacency graph convolution captures the characteristics of the adjacent joints. Third, the model proposed in this study hierarchically extracts different levels of action sequence features, making the model clear and easy to understand; further, it reduces the depth of the model and the number of parameters. A large number of experiments on the NTU RGB + D 60 and 120 data sets show that this method has the advantages of few required parameters, low computational costs, and fast speeds. It also has a simple structure and training process that make it easy to deploy in real-time recognition systems based on low-cost embedded devices.
- Research Article
2
- 10.3390/s22239249
- Nov 28, 2022
- Sensors (Basel, Switzerland)
Skeleton-based action recognition can achieve a relatively high performance by transforming the human skeleton structure in an image into a graph and applying action recognition based on structural changes in the body. Among the many graph convolutional network (GCN) approaches used in skeleton-based action recognition, semantic-guided neural networks (SGNs) are fast action recognition algorithms that hierarchically learn spatial and temporal features by applying a GCN. However, because an SGN focuses on global feature learning rather than local feature learning owing to the structural characteristics, there is a limit to an action recognition in which the dependency between neighbouring nodes is important. To solve these problems and simultaneously achieve a real-time action recognition in low-end devices, in this study, a single head attention (SHA) that can overcome the limitations of an SGN is proposed, and a new SGN-SHA model that combines SHA with an SGN is presented. In experiments on various action recognition benchmark datasets, the proposed SGN-SHA model significantly reduced the computational complexity while exhibiting a performance similar to that of an existing SGN and other state-of-the-art methods.