ASR-GCN: Adaptive spatial information reconstruction GCN for skeleton-based action recognition.
ASR-GCN: Adaptive spatial information reconstruction GCN for skeleton-based action recognition.
- Research Article
2
- 10.1016/j.eswa.2025.128043
- Aug 1, 2025
- Expert Systems with Applications
• Novel study to classify simulated Movements of Interest (MOIs) by action recognition. • Acquired a novel 7-class simulated seizure MOI dataset acted by 8 epileptologists. • Image-based vs. skeleton-based action recognition are compared for MOI classification. • Highlights benefits of skeleton-based action recognition with transfer learning. • Future work should integrate skeleton-based methods with hand gesture recognition. Epileptic seizure classification based on seizure semiology requires automated, quantitative approaches to support the diagnosis of epilepsy, which affects 1% of the world’s population. Current approaches address the problem on a seizure level, neglecting the detailed evaluation of the classification of the underlying action features, also known as Movements of Interest (MOIs), which are critical for epileptologists in determining their classifications. Moreover, it hinders objective comparison of these approaches and attribution of performance differences due to datasets, intra-dataset MOI distribution, or architecture variations. Objective evaluation of action recognition techniques is crucial, with MOIs serving as foundational elements of semiology for clinical in-bed applications to facilitate epileptic seizure classification. However, until now, there were no MOI datasets available nor benchmarks comparing different action recognition approaches for this clinical problem. Therefore, as a pilot, we introduced a novel, simulated seizure semiology dataset carried out by 8 experienced epileptologists in an EMU bed, consisting of 7 MOI classes. We compare several computer vision methods for MOI classification, two image-based (I3D and Uniformerv2), and two skeleton-based (ST-GCN++ and PoseC3D) action recognition approaches. This study emphasizes the advantages of a 2-stage skeleton-based action recognition approach in a transfer learning setting (4 classes) and the multi-scale challenge of MOI classification (7 classes), advocating for the integration of skeleton-based methods with hand gesture recognition technologies in the future. The study’s controlled MOI simulation dataset provides us with the opportunity to advance the development of automated epileptic seizure classification systems, paving the way for enhancing their performance and having the potential to contribute to improved patient care.
- Research Article
2
- 10.3390/s22239249
- Nov 28, 2022
- Sensors (Basel, Switzerland)
Skeleton-based action recognition can achieve a relatively high performance by transforming the human skeleton structure in an image into a graph and applying action recognition based on structural changes in the body. Among the many graph convolutional network (GCN) approaches used in skeleton-based action recognition, semantic-guided neural networks (SGNs) are fast action recognition algorithms that hierarchically learn spatial and temporal features by applying a GCN. However, because an SGN focuses on global feature learning rather than local feature learning owing to the structural characteristics, there is a limit to an action recognition in which the dependency between neighbouring nodes is important. To solve these problems and simultaneously achieve a real-time action recognition in low-end devices, in this study, a single head attention (SHA) that can overcome the limitations of an SGN is proposed, and a new SGN-SHA model that combines SHA with an SGN is presented. In experiments on various action recognition benchmark datasets, the proposed SGN-SHA model significantly reduced the computational complexity while exhibiting a performance similar to that of an existing SGN and other state-of-the-art methods.
- Research Article
7
- 10.1016/j.neucom.2024.127495
- Mar 6, 2024
- Neurocomputing
Representation modeling learning with multi-domain decoupling for unsupervised skeleton-based action recognition
- Research Article
1
- 10.1002/cav.2193
- Jul 1, 2023
- Computer Animation and Virtual Worlds
Skeleton‐based action recognition has been continuously and intensively studied. However, dynamic 3D skeleton data are difficult to be popularized in practical applications due to the restricted data acquisition conditions. Although the action recognition method based on 2D pose information extracted from RGB video can effectively avoid the influence of complex background, it is susceptible to factors such as video jitter and joint overlap. To reduce the interference of the aforementioned factors, we use two‐dimensional skeletal joint coordinate modal information to represent the changes in human body posture. First, we use a target detector and pose estimation algorithm to obtain the joint coordinates of each frame sample from RGB video. Then the feature extraction network is combined to perform multi‐level feature learning to establish correspondence between actions and corresponding multi‐level features. Finally, the hierarchical attention mechanism is introduced to design the model named CHAN. By calculating the association between elements, the weight of the action classification is redistributed. Extensive experiments on three datasets demonstrate the effectiveness of our proposed method.
- Dissertation
- 10.32657/10356/138384
- Jan 1, 2020
Understanding human behavior from videos is a very important task in computer vision community. It is a significant sub-branch of video analysis. Human behav- ior analysis is widely applied in many application scenarios like human-computer interaction, video surveillance, video retrieval and autonomous driving. Thanks to the development of commodity depth cameras, skeleton-based human behavior analysis has drawn considerable attention in the computer vision community re- cently. Skeleton action sequences are extracted from depth camera based on pose estimation algorithms or directly detected from motion capture devices. Compared with RGB-based action sequences, skeleton-based action instances are more sim- plified and semantic. However, the limitation is that there is less appearance and few scene information provided in skeletal data. How to design suitable and stable models to understand human behavior, both body action and hand gesture, using skeletal data is an interesting and challenging topic. To well understand human behavior through action sequence, two tasks are very important, namely the ac- tion recognition and the action prediction. In this thesis, four different models are proposed to handle these distinctive tasks. \nDue to the success of deep learning models in image recognition, most of the state- of-the-arts choose to utilize deep learning as the tool for skeleton-based action recognition. However, compared with images and videos which are composed of millions or billions of pixels, the skeleton is composed by only tens of joints which is thus of much less complexity than images and videos. For such a light-weight data, non-parametric models like Naive-Bayes Nearest Neighbor (NBNN) may be more suitable than the deep learning models with high complexity. In the first two works of this thesis, two robust NBNN-based models, ST-NBNN and ST-NBMIM, are proposed to characterize skeleton sequences. Besides, to better understand skeleton-based actions, the bilinear classifiers are adopted to identify both key tem- poral stages as well as spatial joints for action classification. Although only using a linear classifier, experiments on five benchmark datasets show that by combin- ing the strength of both non-parametric and parametric models, ST-NBNN and ST-NBMIM can achieve competitive performance compared with state-of-the-art results using sophisticated models such as deep learning. Moreover, by identifying key skeleton joints and temporal stages for each action class, the two NBNN-based models can capture the essential spatio-temporal patterns that play key roles of recognizing actions, which is not always achievable by using end-to-end models. \nWhen facing the large-scale skeleton data, the non-parametric model reaches its limitation, and the deep-learning-based models demonstrate their superior perfor- mance on dataset with large size. Meanwhile, human body movements exhibit spatial patterns among pose joints. It is thus of great importance to identify those motion patterns and avoid the non-informative joints, via identifying the key combinations of joints that matter for the recognition. Although key spatio- temporal patterns discovery has been explored previously for skeleton-based action recognition, the temporal dynamics modeling of key joint combinations is not well researched in the community. In the third work of this thesis, a CNN model is proposed to adaptively search key pose joints for each action sequence. The work utilizes the deep-learning technique to train a deformable CNN model to discover sample-related key spatio-temporal patterns for action recognition. This deformable convolution better utilizes the contextual joints for action and gesture recognition and is more robust to noisy joints. The proposed model is evaluated on three benchmark datasets and the experimental results show the effectiveness of introducing temporal dynamics modeling of key joint combinations into the skeleton-based action recognition. \nThe goal of early action recognition is to predict action label when the sequence is partially observed. The existing methods treat the early action recognition task as sequential classification problems on different observation ratios of an action se- quence. Since these models are trained by differentiating positive categories from all negative classes, the diverse information of different negative categories is ignored, which we believe can be collected to help improve the recognition performance. In the last work of this thesis, a new direction, introducing category exclusion to early action recognition, is explored. The category exclusion is modeled as a mask operation on the classification probability output of a pre-trained early action recognition classifier. Specifically, policy-based reinforcement learning is utilized to train an agent. The agent generates a series of binary masks to exclude interfering negative categories during action execution and hence help improve the recogni- tion accuracy. The proposed method is evaluated on three benchmark recognition datasets, and it enhances the recognition accuracy consistently over all different observation ratios on the three datasets, where the accuracy improvements on the early stages are especially significant. \nIn summary, this thesis demonstrates the superior performance of the proposed four methods, including the ST-NBNN, ST-NBMIM, Deformable Pose Traversal Convolution, and the Category Exclusion Agent, for the tasks of action recognition and action prediction of skeleton-based sequences. These four models are exten- sively evaluated on well-known benchmark datasets and the experimental results show the effectiveness of these models on their corresponding tasks.
- Research Article
- 10.54254/2755-2721/50/20241581
- Mar 25, 2024
- Applied and Computational Engineering
The field of research is currently focused on human activity recognition. Hence, numerous pertinent literature reviews have expounded upon the multifaceted nature of data, the process of selecting feature vectors, and the advantages and disadvantages of classification networks. Graph Convolutional Networks (GCNs) have demonstrated significant efficacy in the domain of human action recognition. In recent years, with the rapid development of 3D skeleton data collection, a plethora of studies in action recognition based on skeleton data have emerged. Skeleton data consists of three-dimensional coordinates of multiple spatiotemporal skeletal joints, making it an effective representation of kinematics. It can be easily acquired through low-cost depth sensors and also directly extracted from two-dimensional images using video-based pose estimation algorithms, attracting widespread attention. As relational networks continue to evolve, GCNs have been applied to various fields, including human action recognition. GCNs have demonstrated significant advantages in feature extraction from skeleton data. However, using GCNs alone may have various limitations. Therefore, in recent years, many enhancement measures for GCNs have emerged. This review aims to summarize the research achievements of Graph Convolutional Network improvements in the field of human action recognition in recent years. It intends to assist future researchers in quickly organizing their research ideas, facilitating the generation of new results.
- Research Article
5
- 10.1155/2021/2290304
- Nov 1, 2021
- Security and Communication Networks
In skeleton-based human action recognition methods, human behaviours can be analysed through temporal and spatial changes in the human skeleton. Skeletons are not limited by clothing changes, lighting conditions, or complex backgrounds. This recognition method is robust and has aroused great interest; however, many existing studies used deep-layer networks with large numbers of required parameters to improve the model performance and thus lost the advantage of less computation of skeleton data. It is difficult to deploy previously established models to real-life applications based on low-cost embedded devices. To obtain a model with fewer parameters and a higher accuracy, this study designed a lightweight frame-level joints adaptive graph convolutional network (FLAGCN) model to solve skeleton-based action recognition tasks. Compared with the classical 2s-AGCN model, the new model obtained a higher precision with 1/8 of the parameters and 1/9 of the floating-point operations (FLOPs). Our proposed network characterises three main improvements. First, a previous feature-fusion method replaces the multistream network and reduces the number of required parameters. Second, at the spatial level, two kinds of graph convolution methods capture different aspects of human action information. A frame-level graph convolution constructs a human topological structure for each data frame, whereas an adjacency graph convolution captures the characteristics of the adjacent joints. Third, the model proposed in this study hierarchically extracts different levels of action sequence features, making the model clear and easy to understand; further, it reduces the depth of the model and the number of parameters. A large number of experiments on the NTU RGB + D 60 and 120 data sets show that this method has the advantages of few required parameters, low computational costs, and fast speeds. It also has a simple structure and training process that make it easy to deploy in real-time recognition systems based on low-cost embedded devices.
- Research Article
179
- 10.1007/s10462-017-9545-7
- Feb 28, 2017
- Artificial Intelligence Review
Suspicious human activity recognition from surveillance video is an active research area of image processing and computer vision. Through the visual surveillance, human activities can be monitored in sensitive and public areas such as bus stations, railway stations, airports, banks, shopping malls, school and colleges, parking lots, roads, etc. to prevent terrorism, theft, accidents and illegal parking, vandalism, fighting, chain snatching, crime and other suspicious activities. It is very difficult to watch public places continuously, therefore an intelligent video surveillance is required that can monitor the human activities in real-time and categorize them as usual and unusual activities; and can generate an alert. Recent decade witnessed a good number of publications in the field of visual surveillance to recognize the abnormal activities. Furthermore, a few surveys can be seen in the literature for the different abnormal activities recognition; but none of them have addressed different abnormal activities in a review. In this paper, we present the state-of-the-art which demonstrates the overall progress of suspicious activity recognition from the surveillance videos in the last decade. We include a brief introduction of the suspicious human activity recognition with its issues and challenges. This paper consists of six abnormal activities such as abandoned object detection, theft detection, fall detection, accidents and illegal parking detection on road, violence activity detection, and fire detection. In general, we have discussed all the steps those have been followed to recognize the human activity from the surveillance videos in the literature; such as foreground object extraction, object detection based on tracking or non-tracking methods, feature extraction, classification; activity analysis and recognition. The objective of this paper is to provide the literature review of six different suspicious activity recognition systems with its general framework to the researchers of this field.
- Research Article
39
- 10.1038/s41598-025-87752-8
- Feb 10, 2025
- Scientific Reports
For the purpose of achieving accurate skeleton-based action recognition, the majority of prior approaches have adopted a serial strategy that combines Graph Convolutional Networks (GCNs) with attention-based methods. However, this approach frequently treats the human skeleton as an isolated and complete structure, neglecting the significance of highly correlated yet indirectly connected skeletal parts, finally hindering recognition accuracy. This study proposes a novel architecture addressing this limitation by implementing a parallel configuration of GCNs and the Transformer model (SA-TDGFormer). This parallel structure integrates the advantages of both the GCN model and the Transformer model, facilitating the extraction of both local and global spatio-temporal features, leading to more accurate motion information encoding and improved recognition performance. The proposed model distinguishes itself through its dual-stream structure: a spatiotemporal GCN stream and a spatiotemporal Transformer stream. The former focuses on capturing the topological structure and motion representations of human skeletons. In contrast, the latter seeks to capture motion representations that consist of global inter-joint relationships. Recognizing the unique feature representations generated by these streams and their limited mutual understanding, the model also incorporates a late fusion strategy to merge the results from the two streams. This fusion allows the spatiotemporal GCN and Transformer streams to complement each other, enriching action features and maximizing information exchange between the two representation types. Empirical validation on three established benchmark datasets, NTU RGB + D 60, NTU RGB + D 120, and Kinetics-Skeleton, substantiates the model’s effectiveness. The experimental results indicate that, compared to existing classification frameworks, the method proposed in this paper improves the accuracy of human action recognition by 1–5% (NTU RGB + D 60 dataset). This improvement demonstrates the superior performance of the model in action recognition.
- Research Article
17
- 10.1109/tpami.2024.3466212
- Jan 1, 2025
- IEEE transactions on pattern analysis and machine intelligence
Skeleton-based action recognition has made significant advancements recently, with models like InfoGCN showcasing remarkable accuracy. However, these models exhibit a key limitation: they necessitate complete action observation prior to classification, which constrains their applicability in real-time situations such as surveillance and robotic systems. To overcome this barrier, we introduce InfoGCN++, an innovative extension of InfoGCN, explicitly developed for online skeleton-based action recognition. InfoGCN++ augments the abilities of the original InfoGCN model by allowing real-time categorization of action types, independent of the observation sequence's length. It transcends conventional approaches by learning from current and anticipated future movements, thereby creating a more thorough representation of the entire sequence. Our approach to prediction is managed as an extrapolation issue, grounded on observed actions. To enable this, InfoGCN++ incorporates Neural Ordinary Differential Equations, a concept that lets it effectively model the continuous evolution of hidden states. Following rigorous evaluations on three skeleton-based action recognition benchmarks, InfoGCN++ demonstrates exceptional performance in online action recognition. It consistently equals or exceeds existing techniques, highlighting its significant potential to reshape the landscape of real-time action recognition applications. Consequently, this work represents a major leap forward from InfoGCN, pushing the limits of what's possible in online, skeleton-based action recognition.
- Research Article
16
- 10.1016/j.neucom.2022.07.046
- Jul 16, 2022
- Neurocomputing
Adaptive spatiotemporal graph convolutional network with intermediate aggregation of multi-stream skeleton features for action recognition
- Research Article
96
- 10.1016/j.neucom.2023.03.001
- Mar 30, 2023
- Neurocomputing
Transformer for Skeleton-based action recognition: A review of recent advances
- Research Article
80
- 10.1016/j.neucom.2022.09.071
- Sep 20, 2022
- Neurocomputing
Action recognition based on RGB and skeleton data sets: A survey
- Book Chapter
- 10.3233/faia230871
- Nov 30, 2023
- Frontiers in artificial intelligence and applications
Skeleton-based action recognition algorithms have made extensive use of graph topologies to model the connections between human skeletal joints. Constructing graph topology models with greater representational power is key to obtaining powerful feature extractors. However, existing methods are not as effective in achieving stronger modelling of associations between physically connected joints in the human skeleton. To tackle these concerns, we put forward an innovative approach involving a distinct graph topology - a graph convolutional neural network (SGC-GCN) that amalgamates a dynamic, refined graph topology with a static, shared graph topology. The dynamic partial feature extractor (CTR-GC) and the static partial feature extractor (SGC-GC) are used in conjunction with each other to obtain a stronger feature aggregation capability in the form of static reinforcement of dynamic weaker features and to achieve the objective of strengthening the correlation modelling between interconnected joints within the human skeletal structure. The combined use of the two also introduces only a small number of additional parameters, ensuring that the refined features are not affected by the noise of statically shared features. Combining SGC-GC with the temporal modelling module has resulted in the development of SGC-GCN, a graphical convolutional network with even greater feature aggregation capability. Our network demonstrates a remarkable performance surpassing existing advanced methods on the dataset(NTU RGB+D), yielding substantial advancements in action recognition accuracy.
- Research Article
44
- 10.1016/j.patcog.2023.109540
- Mar 21, 2023
- Pattern Recognition
Global spatio-temporal synergistic topology learning for skeleton-based action recognition