TE-STGCN: Topology enhanced spatio-temporal graph convolutional network for skeleton-based action recognition
TE-STGCN: Topology enhanced spatio-temporal graph convolutional network for skeleton-based action recognition
- Conference Article
11
- 10.1109/icpr48806.2021.9413189
- Jan 10, 2021
Spatio-temporal Graph Convolutional Networks (ST-GCNs) have shown great performance in the context of skeleton-based action recognition. Nevertheless, ST-GCNs use raw skeleton data as vertex features. Such features have low dimensionality and might not be optimal for action discrimination. Moreover, a single layer of temporal convolution is used to model short-term temporal dependencies but can be insufficient for capturing both long-term. In this paper, we extend the Spatio-Temporal Graph Convolutional Network for skeleton-based action recognition by introducing two novel modules, namely, the Graph Vertex Feature Encoder (GVFE) and the Dilated Hierarchical Temporal Convolutional Network (DH-TCN). On the one hand, the GVFE module learns appropriate vertex features for action recognition by encoding raw skeleton data into a new feature space. On the other hand, the DH-TCN module is capable of capturing both short-term and long-term temporal dependencies using a hierarchical dilated convolutional network. Experiments have been conducted on the challenging NTU RGB-D 60, NTU RGB-D 120 and Kinetics datasets. The obtained results show that our method competes with state-of-the-art approaches while using a smaller number of layers and parameters; thus reducing the required training time and memory.
- Video Transcripts
- 10.48448/47yg-0867
- Dec 29, 2020
- Underline Science Inc.
Spatio-temporal Graph Convolutional Networks (ST-GCNs) have shown great performance in the context of skeleton-based action recognition. Nevertheless, ST-GCNs use raw skeleton data as vertex features. Such features have low dimensionality and might not be optimal for action discrimination. Moreover, a single layer of temporal convolution is used to model short-term temporal dependencies but can be insufficient for capturing both long-term. In this paper, we extend the Spatio-Temporal Graph Convolutional Network for skeleton-based action recognition by introducing two novel modules, namely, the Graph Vertex Feature Encoder (GVFE) and the Dilated Hierarchical Temporal Convolutional Network (DH-TCN). On the one hand, the GVFE module learns appropriate vertex features for action recognition by encoding raw skeleton data into a new feature space. On the other hand, the DH-TCN module is capable of capturing both short-term and long-term temporal dependencies using a hierarchical dilated convolutional network. Experiments have been conducted on the challenging NTU RGB-D 60, NTU RGB-D 120 and Kinetics datasets. The obtained results show that our method competes with state-of-the-art approaches while using a smaller number of layers and parameters; thus reducing the required training time and memory.
- Research Article
2
- 10.1016/j.asoc.2024.111963
- Jul 9, 2024
- Applied Soft Computing
Human movement science-informed multi-task spatio temporal graph convolutional networks for fitness action recognition and evaluation
- Research Article
51
- 10.1109/access.2022.3164711
- Jan 1, 2022
- IEEE Access
Skeleton-based Graph Convolutional Networks (GCN) for human action and interaction recognition have received considerable attention of researchers due to its compact and view-invariant nature of skeleton data. However, the static skeleton graph topology in conventional GCNs does not reflect the implicit relationships of non-adjacent joints, which contain vital latent information for a skeleton pose in an action sequence. Moreover, traditional tri-categorical node partitioning strategy discards much of the motion dependencies along temporal dimension for non-physically connected edges. We propose an extended skeleton graph topology along with extended partitioning strategy to extract much of the non-adjacent joint relational information in the model for robust discriminative features. Extended skeleton graph represents joints as vertices and weighted edges represent intrinsic and extrinsic relationships between physically connected and non-physically connected joints respectively. Furthermore, extended partitioning strategy divides the input graph for GCN as five-categorical fixed-length tensor to encompass maximal motion dependencies. Finally, the extended skeleton graph and partitioning strategy are realized by adopting Spatio-Temporal Graph Convolutional Network (ST-GCN). The experiments carried out over three large scale datasets NTU-RGB+D, NTU-RGB+D 120 and Kinetics-Skeleton show improved performance over conventional state-of-the-art ST-GCNs.
- Research Article
39
- 10.1038/s41598-025-87752-8
- Feb 10, 2025
- Scientific Reports
For the purpose of achieving accurate skeleton-based action recognition, the majority of prior approaches have adopted a serial strategy that combines Graph Convolutional Networks (GCNs) with attention-based methods. However, this approach frequently treats the human skeleton as an isolated and complete structure, neglecting the significance of highly correlated yet indirectly connected skeletal parts, finally hindering recognition accuracy. This study proposes a novel architecture addressing this limitation by implementing a parallel configuration of GCNs and the Transformer model (SA-TDGFormer). This parallel structure integrates the advantages of both the GCN model and the Transformer model, facilitating the extraction of both local and global spatio-temporal features, leading to more accurate motion information encoding and improved recognition performance. The proposed model distinguishes itself through its dual-stream structure: a spatiotemporal GCN stream and a spatiotemporal Transformer stream. The former focuses on capturing the topological structure and motion representations of human skeletons. In contrast, the latter seeks to capture motion representations that consist of global inter-joint relationships. Recognizing the unique feature representations generated by these streams and their limited mutual understanding, the model also incorporates a late fusion strategy to merge the results from the two streams. This fusion allows the spatiotemporal GCN and Transformer streams to complement each other, enriching action features and maximizing information exchange between the two representation types. Empirical validation on three established benchmark datasets, NTU RGB + D 60, NTU RGB + D 120, and Kinetics-Skeleton, substantiates the model’s effectiveness. The experimental results indicate that, compared to existing classification frameworks, the method proposed in this paper improves the accuracy of human action recognition by 1–5% (NTU RGB + D 60 dataset). This improvement demonstrates the superior performance of the model in action recognition.
- Conference Article
977
- 10.1109/cvpr.2019.00132
- Jun 1, 2019
Skeleton-based action recognition is an important task that requires the adequate understanding of movement characteristics of a human action from the given skeleton sequence. Recent studies have shown that exploring spatial and temporal features of the skeleton sequence is vital for this task. Nevertheless, how to effectively extract discriminative spatial and temporal features is still a challenging problem. In this paper, we propose a novel Attention Enhanced Graph Convolutional LSTM Network (AGC-LSTM) for human action recognition from skeleton data. The proposed AGC-LSTM can not only capture discriminative features in spatial configuration and temporal dynamics but also explore the co-occurrence relationship between spatial and temporal domains. We also present a temporal hierarchical architecture to increases temporal receptive fields of the top AGC-LSTM layer, which boosts the ability to learn the high-level semantic representation and significantly reduces the computation cost. Furthermore, to select discriminative spatial information, the attention mechanism is employed to enhance information of key joints in each AGC-LSTM layer. Experimental results on two datasets are provided: NTU RGB+D dataset and Northwestern-UCLA dataset. The comparison results demonstrate the effectiveness of our approach and show that our approach outperforms the state-of-the-art methods on both datasets.
- Research Article
7
- 10.1007/s00138-023-01386-2
- Apr 5, 2023
- Machine Vision and Applications
Thanks to the development of depth sensors and pose estimation algorithms, skeleton-based action recognition has become prevalent in the computer vision community. Most of the existing works are based on spatio-temporal graph convolutional network frameworks, which learn and treat all spatial or temporal features equally, ignoring the interaction with channel dimension to explore different contributions of different spatio-temporal patterns along the channel direction and thus losing the ability to distinguish confusing actions with subtle differences. In this paper, an interactional channel excitation (ICE) module is proposed to explore discriminative spatio-temporal features of actions by adaptively recalibrating channel-wise pattern maps. More specifically, a channel-wise spatial excitation (CSE) is incorporated to capture the crucial body global structure patterns to excite the spatial-sensitive channels. A channel-wise temporal excitation (CTE) is designed to learn temporal inter-frame dynamics information to excite the temporal-sensitive channels. ICE enhances different backbones as a plug-and-play module. Furthermore, we systematically investigate the strategies of graph topology and argue that complementary information is necessary for sophisticated action description. Finally, together equipped with ICE, an interactional channel excited graph convolutional network with complementary topology (ICE-GCN) is proposed and evaluated on three large-scale datasets, NTU RGB+D 60, NTU RGB+D 120, and Kinetics-Skeleton. Extensive experimental results and ablation studies demonstrate that our method outperforms other SOTAs and proves the effectiveness of individual sub-modules. The code will be published at https://github.com/shuxiwang/ICE-GCN.
- Research Article
10
- 10.1007/s00530-024-01566-8
- Nov 27, 2024
- Multimedia Systems
Due to the emergence of graph convolutional networks (GCNs), the skeleton-based action recognition has achieved remarkable results. However, the current models for skeleton-based action analysis treat skeleton sequences as a series of graphs, aggregating features of the entire sequence by alternately extracting spatial and temporal features, i.e., using a 2D (spatial features) plus 1D (temporal features) approach for feature extraction. This undoubtedly overlooks the complex spatiotemporal fusion relationships between joints during motion, making it challenging for models to capture the connections between different temporal frames and joints. In this paper, we propose a Multimodal Graph Self-Attention Network (MGSAN), which combines GCNs with self-attention to model the spatiotemporal relationships between skeleton sequences. Firstly, we design graph self-attention (GSA) blocks to capture the intrinsic topology and long-term temporal dependencies between joints. Secondly, we propose a multi-scale spatio-temporal convolutional network for channel-wise topology modeling (CW-TCN) to model short-term smooth temporal information of joint movements. Finally, we propose a multimodal fusion strategy to fuse joint, joint movement, and bone flow, providing the model with a richer set of multimodal features to make better predictions. The proposed MGSAN achieves state-of-the-art performance on three large-scale skeleton-based action recognition datasets, with accuracy of 93.1% on NTU RGB+D 60 cross-subject benchmark, 90.3% on NTU RGB+D 120 cross-subject benchmark, and 97.0% on the NW-UCLA dataset. Code is available at https://github.com/lizaowo/MGSAN.
- Conference Article
- 10.1145/3430199.3430213
- Jun 26, 2020
How to effectively extract discriminative spatial and temporal features is important for skeleton-based action recognition. However, current researches on skeleton-based action recognition mainly focus on the natural connections of the skeleton and original temporal sequences of the skeleton frames, which ignore the inter-related relation of non-adjacent joints and the variant velocities of action instances. To overcome these limitations and therefore enhance the spatial and temporal features extraction for action recognition, we propose a novel Spatial Attention-Enhanced Multi-Timescale Graph Convolutional Network (SA-MTGCN) for skeleton-based action recognition. Specifically, as the relation of non-adjacent but inter-related joints is beneficial for action recognition, we propose an Attention-Enhanced Spatial Graph Convolutional Network (A-SGCN) to use both natural connection and inter-related relation of joints. Furthermore, a Multi-Timescale (MT) structure is proposed to enhance temporal feature extraction by gathering different network layers to model different velocities of action instances. Experimental results on the two widely used NTU and Kinetics datasets demonstrate the effectiveness of our approach.
- Book Chapter
1
- 10.1007/978-981-99-2287-1_100
- Jan 1, 2023
With the rise of health awareness, people's demand for fitness has gradually increased. However, improper exercise may easily cause damage to the body. It would be possible to avoid wrong actions if automatic action recognition can detect and judge the human motion of exercises. Therefore, we aim to grasp the user's fitness status through human action recognition. However, most human action recognition mostly uses CNN-based models to process images, which may introduce unnecessary noise other than the human body from the background. To address this problem, we use the Spatio-Temporal Graph Convolutional Network (ST-GCN) as the backbone and take skeleton data as input to learn skeleton relationships. To further improve the accuracy, we propose a novel partition strategy based on Five Primary Kinetic Chains (5PKC) to explore the skeleton partition status and then enrich the skeleton relationships. Finally, the proposed method with 9 ST-GCN blocks that integrated the proposed partition strategy achieved 99.5% of accuracy which outperforms the model using 9 ST-GCN blocks with 84.5%.
- Research Article
7
- 10.1016/j.cviu.2024.103936
- Jan 11, 2024
- Computer Vision and Image Understanding
Fourier analysis on robustness of graph convolutional neural networks for skeleton-based action recognition
- Research Article
- 10.1038/s41598-025-26405-2
- Nov 27, 2025
- Scientific Reports
Traditional police combat training relies heavily on subjective evaluation by human instructors, which lacks consistency and comprehensive coverage of complex movement patterns in real-world scenarios. This paper presents an enhanced deep spatio-temporal graph convolutional network (ST-GCN) framework specifically designed for automated police combat action recognition and quality assessment. The proposed method incorporates adaptive graph topology learning mechanisms that dynamically adjust spatial connectivity patterns based on action-specific joint relationships, multi-modal fusion strategies combining skeletal and RGB video data for robust recognition under diverse environmental conditions, and comprehensive quality assessment algorithms providing objective evaluation of technique execution. The enhanced ST-GCN architecture features attention-guided feature extraction, curriculum learning-based training strategies, and real-time processing capabilities suitable for practical deployment in training facilities. Experimental validation on a comprehensive police combat dataset demonstrates superior performance with 96.7% recognition accuracy across twelve action categories and real-time processing at 42.8 frames per second. The multi-dimensional evaluation framework successfully assesses action completion, standardization compliance, and movement fluency, providing immediate feedback for skill development. The proposed system offers significant improvements over conventional approaches, enabling standardized evaluation criteria, data-driven curriculum development, and enhanced training effectiveness for law enforcement personnel.
- Research Article
10
- 10.1109/access.2021.3049808
- Jan 1, 2021
- IEEE Access
In recent years, skeleton-based action recognition, graph convolutional networks, have achieved remarkable performance. In these existing works, the features of all nodes in the neighbor set are aggregated into the updated features of the root node, while these features are located in the same feature channel determined by the same 1 × 1 convolution filter. This may not be optimal for capturing the features of spatial dimensions among adjacent vertices effectively. Besides, the effect of feature channels that are independent of the current action on the performance of the model is rarely investigated in existing methods. In this paper, we propose cross-channel graph convolutional networks for skeleton-based action recognition. The features fusion mechanism in our network is cross-channel, i.e, the updated feature of the root node is derived from different feature channels. Because different feature channels come from different 1 × 1 convolution filters, the cross-channel fusion mechanism significantly improves the ability of the model to capture local features among adjacent vertices. Moreover, by introducing a channel attention mechanism to our model, we suppress the influence of feature channels unrelated to action recognition on model performance, which improves the robustness of the model against the feature channels independent of the current action. Extensive experiments on the two large-scale datasets, NTU-RGB+D and KineticsSkeleton, demonstrate that the performance of our model exceeds the current mainstream methods.
- Research Article
142
- 10.1016/j.compag.2019.105087
- Nov 12, 2019
- Computers and Electronics in Agriculture
Recent developments have shown that Deep Learning approaches are well suited for Human Action Recognition. On the other hand, the application of deep learning for action or behaviour recognition in other domains such as animal or livestock is comparatively limited. Action recognition in fish is a particularly challenging task due to specific research challenges such as the lack of distinct poses in fish behavior and the capture of spatio-temporal changes. Action recognition of salmon is valuable in relation to managing and optimizing many aquaculture operations today such as feeding, as one of the most costly operations in aquaculture. Inspired by these application domains and research challenges we introduce a deep video classification network for action recognition of salmon from underwater videos. We propose a Dual-Stream Recurrent Network (DSRN) to automatically capture the spatio-temporal behavior of salmon during swimming. The DSRN combines the spatial and motion-temporal information through the use of a spatial network, a 3D-convolutional motion network and a LSTM recurrent classification network. The DSRN shows an accuracy that is suitable for industrial use in prediction of salmon behavior with a prediction accuracy of 80%, validated on the task of predicting Feeding and NonFeeding behavior in salmon at a real fish farm during production. Our results show that the DSRN architecture has high potential in feeding action recognition for salmon in aquaculture and for applications domains lacking distinct poses and with dynamic spatio-temporal changes.
- Research Article
- 10.1038/s41598-025-34288-6
- Dec 30, 2025
- Scientific reports
While learnable adjacency matrices have been explored to enhance the flexibility of Spatio-Temporal Graph Convolutional Networks (ST-GCNs) for action recognition, their application in wearable sensor-based systems often overlooks a critical constraint: the need to maintain biomechanically plausible connections while adapting to non-standard sensor placements. To address this, we propose a dynamic topology-adaptive ST-GCN framework that strategically initializes the learnable adjacency matrix with a human skeleton prior. This ensures that the initial graph structure is physiologically meaningful. Subsequently, the model refines this topology through end-to-end training, incorporating L2 regularization and periodic Top-K sparsification to prevent overfitting and maintain a sparse, interpretable structure. This approach allows the model to dynamically correct for structural deviations caused by variations in sensor positioning across individuals, without deviating from realistic body kinematics. Evaluated on data from eight IMU sensors, our method achieves an average recognition accuracy of 94.1 ± 0.6% in cross-user scenarios and 91.5 ± 0.5% in cross-device tests, demonstrating superior robustness for sports posture recognition under non-standardized deployment conditions.