Improving skeleton-based action recognition with interactive object information

  • Abstract
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon

Improving skeleton-based action recognition with interactive object information

Similar Papers
  • Research Article
  • Cite Count Icon 2
  • 10.1016/j.eswa.2025.128043
Exploring image and skeleton-based action recognition approaches for clinical in-bed classification of simulated epileptic seizure movements
  • Aug 1, 2025
  • Expert Systems with Applications
  • Tamás Karácsony + 9 more

• Novel study to classify simulated Movements of Interest (MOIs) by action recognition. • Acquired a novel 7-class simulated seizure MOI dataset acted by 8 epileptologists. • Image-based vs. skeleton-based action recognition are compared for MOI classification. • Highlights benefits of skeleton-based action recognition with transfer learning. • Future work should integrate skeleton-based methods with hand gesture recognition. Epileptic seizure classification based on seizure semiology requires automated, quantitative approaches to support the diagnosis of epilepsy, which affects 1% of the world’s population. Current approaches address the problem on a seizure level, neglecting the detailed evaluation of the classification of the underlying action features, also known as Movements of Interest (MOIs), which are critical for epileptologists in determining their classifications. Moreover, it hinders objective comparison of these approaches and attribution of performance differences due to datasets, intra-dataset MOI distribution, or architecture variations. Objective evaluation of action recognition techniques is crucial, with MOIs serving as foundational elements of semiology for clinical in-bed applications to facilitate epileptic seizure classification. However, until now, there were no MOI datasets available nor benchmarks comparing different action recognition approaches for this clinical problem. Therefore, as a pilot, we introduced a novel, simulated seizure semiology dataset carried out by 8 experienced epileptologists in an EMU bed, consisting of 7 MOI classes. We compare several computer vision methods for MOI classification, two image-based (I3D and Uniformerv2), and two skeleton-based (ST-GCN++ and PoseC3D) action recognition approaches. This study emphasizes the advantages of a 2-stage skeleton-based action recognition approach in a transfer learning setting (4 classes) and the multi-scale challenge of MOI classification (7 classes), advocating for the integration of skeleton-based methods with hand gesture recognition technologies in the future. The study’s controlled MOI simulation dataset provides us with the opportunity to advance the development of automated epileptic seizure classification systems, paving the way for enhancing their performance and having the potential to contribute to improved patient care.

  • Research Article
  • Cite Count Icon 3
  • 10.1016/j.patrec.2023.11.010
Participants-based Synchronous Optimization Network for skeleton-based action recognition
  • Nov 8, 2023
  • Pattern Recognition Letters
  • Danfeng Zhuang + 2 more

Participants-based Synchronous Optimization Network for skeleton-based action recognition

  • Research Article
  • Cite Count Icon 1
  • 10.54939/1859-1043.j.mst.csce6.2022.77-91
Hand action recognition in rehabilitation exercise method using R(2+1)D deep learning network and interactive object information
  • Dec 30, 2022
  • Journal of Military Science and Technology
  • Nguyen Sinh Huy + 6 more

Hand action recognition in rehabilitation exercises is to automatically recognize what exercises the patient has done. This is an important step in an AI system to assist doctors to handle, monitor and assess the patient’s rehabilitation. The expected system uses videos obtained from the patient's body-worn camera to recognize hand action automatically. In this paper, we propose a model to recognize the patient's hand action in rehabilitation exercises, which is a combination of the results of a deep learning network recognizing actions on Video RGB, R(2+1)D, and a main interactive object in the exercises detection algorithm. The proposed model is implemented, trained, and tested on a dataset of rehabilitation exercises collected from wearable cameras of patients. The experimental results show that the accuracy in exercise recognition is practicable, averaging 88.43% on the test data independent of the training data. The action recognition results of the proposed method outperform the results of a single R(2+1)D network. Furthermore, the better results show the reduced rate of confusion between exercises with similar hand gestures. They also prove that the combination of interactive object information and the action recognition improve the accuracy significantly.

  • Dissertation
  • 10.32657/10356/138384
Recognizing and predicting human actions with depth camera
  • Jan 1, 2020
  • Junwu Weng

Understanding human behavior from videos is a very important task in computer vision community. It is a significant sub-branch of video analysis. Human behav- ior analysis is widely applied in many application scenarios like human-computer interaction, video surveillance, video retrieval and autonomous driving. Thanks to the development of commodity depth cameras, skeleton-based human behavior analysis has drawn considerable attention in the computer vision community re- cently. Skeleton action sequences are extracted from depth camera based on pose estimation algorithms or directly detected from motion capture devices. Compared with RGB-based action sequences, skeleton-based action instances are more sim- plified and semantic. However, the limitation is that there is less appearance and few scene information provided in skeletal data. How to design suitable and stable models to understand human behavior, both body action and hand gesture, using skeletal data is an interesting and challenging topic. To well understand human behavior through action sequence, two tasks are very important, namely the ac- tion recognition and the action prediction. In this thesis, four different models are proposed to handle these distinctive tasks.
\nDue to the success of deep learning models in image recognition, most of the state- of-the-arts choose to utilize deep learning as the tool for skeleton-based action recognition. However, compared with images and videos which are composed of millions or billions of pixels, the skeleton is composed by only tens of joints which is thus of much less complexity than images and videos. For such a light-weight data, non-parametric models like Naive-Bayes Nearest Neighbor (NBNN) may be more suitable than the deep learning models with high complexity. In the first two works of this thesis, two robust NBNN-based models, ST-NBNN and ST-NBMIM, are proposed to characterize skeleton sequences. Besides, to better understand skeleton-based actions, the bilinear classifiers are adopted to identify both key tem- poral stages as well as spatial joints for action classification. Although only using a linear classifier, experiments on five benchmark datasets show that by combin- ing the strength of both non-parametric and parametric models, ST-NBNN and ST-NBMIM can achieve competitive performance compared with state-of-the-art results using sophisticated models such as deep learning. Moreover, by identifying key skeleton joints and temporal stages for each action class, the two NBNN-based models can capture the essential spatio-temporal patterns that play key roles of recognizing actions, which is not always achievable by using end-to-end models.
\nWhen facing the large-scale skeleton data, the non-parametric model reaches its limitation, and the deep-learning-based models demonstrate their superior perfor- mance on dataset with large size. Meanwhile, human body movements exhibit spatial patterns among pose joints. It is thus of great importance to identify those motion patterns and avoid the non-informative joints, via identifying the key combinations of joints that matter for the recognition. Although key spatio- temporal patterns discovery has been explored previously for skeleton-based action recognition, the temporal dynamics modeling of key joint combinations is not well researched in the community. In the third work of this thesis, a CNN model is proposed to adaptively search key pose joints for each action sequence. The work utilizes the deep-learning technique to train a deformable CNN model to discover sample-related key spatio-temporal patterns for action recognition. This deformable convolution better utilizes the contextual joints for action and gesture recognition and is more robust to noisy joints. The proposed model is evaluated on three benchmark datasets and the experimental results show the effectiveness of introducing temporal dynamics modeling of key joint combinations into the skeleton-based action recognition.
\nThe goal of early action recognition is to predict action label when the sequence is partially observed. The existing methods treat the early action recognition task as sequential classification problems on different observation ratios of an action se- quence. Since these models are trained by differentiating positive categories from all negative classes, the diverse information of different negative categories is ignored, which we believe can be collected to help improve the recognition performance. In the last work of this thesis, a new direction, introducing category exclusion to early action recognition, is explored. The category exclusion is modeled as a mask operation on the classification probability output of a pre-trained early action recognition classifier. Specifically, policy-based reinforcement learning is utilized to train an agent. The agent generates a series of binary masks to exclude interfering negative categories during action execution and hence help improve the recogni- tion accuracy. The proposed method is evaluated on three benchmark recognition datasets, and it enhances the recognition accuracy consistently over all different observation ratios on the three datasets, where the accuracy improvements on the early stages are especially significant.
\nIn summary, this thesis demonstrates the superior performance of the proposed four methods, including the ST-NBNN, ST-NBMIM, Deformable Pose Traversal Convolution, and the Category Exclusion Agent, for the tasks of action recognition and action prediction of skeleton-based sequences. These four models are exten- sively evaluated on well-known benchmark datasets and the experimental results show the effectiveness of these models on their corresponding tasks.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 5
  • 10.1155/2021/2290304
A Lightweight Hierarchical Model with Frame-Level Joints Adaptive Graph Convolution for Skeleton-Based Action Recognition
  • Nov 1, 2021
  • Security and Communication Networks
  • Yujian Jiang + 3 more

In skeleton-based human action recognition methods, human behaviours can be analysed through temporal and spatial changes in the human skeleton. Skeletons are not limited by clothing changes, lighting conditions, or complex backgrounds. This recognition method is robust and has aroused great interest; however, many existing studies used deep-layer networks with large numbers of required parameters to improve the model performance and thus lost the advantage of less computation of skeleton data. It is difficult to deploy previously established models to real-life applications based on low-cost embedded devices. To obtain a model with fewer parameters and a higher accuracy, this study designed a lightweight frame-level joints adaptive graph convolutional network (FLAGCN) model to solve skeleton-based action recognition tasks. Compared with the classical 2s-AGCN model, the new model obtained a higher precision with 1/8 of the parameters and 1/9 of the floating-point operations (FLOPs). Our proposed network characterises three main improvements. First, a previous feature-fusion method replaces the multistream network and reduces the number of required parameters. Second, at the spatial level, two kinds of graph convolution methods capture different aspects of human action information. A frame-level graph convolution constructs a human topological structure for each data frame, whereas an adjacency graph convolution captures the characteristics of the adjacent joints. Third, the model proposed in this study hierarchically extracts different levels of action sequence features, making the model clear and easy to understand; further, it reduces the depth of the model and the number of parameters. A large number of experiments on the NTU RGB + D 60 and 120 data sets show that this method has the advantages of few required parameters, low computational costs, and fast speeds. It also has a simple structure and training process that make it easy to deploy in real-time recognition systems based on low-cost embedded devices.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 2
  • 10.3390/s22239249
Lightweight Semantic-Guided Neural Networks Based on Single Head Attention for Action Recognition
  • Nov 28, 2022
  • Sensors (Basel, Switzerland)
  • Seon-Bin Kim + 3 more

Skeleton-based action recognition can achieve a relatively high performance by transforming the human skeleton structure in an image into a graph and applying action recognition based on structural changes in the body. Among the many graph convolutional network (GCN) approaches used in skeleton-based action recognition, semantic-guided neural networks (SGNs) are fast action recognition algorithms that hierarchically learn spatial and temporal features by applying a GCN. However, because an SGN focuses on global feature learning rather than local feature learning owing to the structural characteristics, there is a limit to an action recognition in which the dependency between neighbouring nodes is important. To solve these problems and simultaneously achieve a real-time action recognition in low-end devices, in this study, a single head attention (SHA) that can overcome the limitations of an SGN is proposed, and a new SGN-SHA model that combines SHA with an SGN is presented. In experiments on various action recognition benchmark datasets, the proposed SGN-SHA model significantly reduced the computational complexity while exhibiting a performance similar to that of an existing SGN and other state-of-the-art methods.

  • PDF Download Icon
  • Research Article
  • 10.3390/math10213923
Interactive Learning of a Dual Convolution Neural Network for Multi-Modal Action Recognition
  • Oct 22, 2022
  • Mathematics
  • Qingxia Li + 4 more

RGB and depth modalities contain more abundant and interactive information, and convolutional neural networks (ConvNets) based on multi-modal data have achieved successful progress in action recognition. Due to the limitation of a single stream, it is difficult to improve recognition performance by learning multi-modal interactive features. Inspired by the multi-stream learning mechanism and spatial-temporal information representation methods, we construct dynamic images by using the rank pooling method and design an interactive learning dual-ConvNet (ILD-ConvNet) with a multiplexer module to improve action recognition performance. Built on the rank pooling method, the constructed visual dynamic images can capture the spatial-temporal information from entire RGB videos. We extend this method to depth sequences to obtain more abundant multi-modal spatial-temporal information as the inputs of the ConvNets. In addition, we design a dual ILD-ConvNet with multiplexer modules to jointly learn the interactive features of two-stream from RGB and depth modalities. The proposed recognition framework has been tested on two benchmark multi-modal datasets—NTU RGB + D 120 and PKU-MMD. The proposed ILD-ConvNet with a temporal segmentation mechanism achieves an accuracy of 86.9% and 89.4% for Cross-Subject (C-Sub) and Cross-Setup (C-Set) on NTU RGB + D 120, 92.0% and 93.1% for Cross-Subject (C-Sub) and Cross-View (C-View) on PKU-MMD, which are comparable with the state of the art. The experimental results shown that our proposed ILD-ConvNet with a multiplexer module can extract interactive features from different modalities to enhance action recognition performance.

  • Research Article
  • Cite Count Icon 17
  • 10.1109/tpami.2024.3466212
InfoGCN++: Learning Representation by Predicting the Future for Online Skeleton-Based Action Recognition.
  • Jan 1, 2025
  • IEEE transactions on pattern analysis and machine intelligence
  • Seunggeun Chi + 3 more

Skeleton-based action recognition has made significant advancements recently, with models like InfoGCN showcasing remarkable accuracy. However, these models exhibit a key limitation: they necessitate complete action observation prior to classification, which constrains their applicability in real-time situations such as surveillance and robotic systems. To overcome this barrier, we introduce InfoGCN++, an innovative extension of InfoGCN, explicitly developed for online skeleton-based action recognition. InfoGCN++ augments the abilities of the original InfoGCN model by allowing real-time categorization of action types, independent of the observation sequence's length. It transcends conventional approaches by learning from current and anticipated future movements, thereby creating a more thorough representation of the entire sequence. Our approach to prediction is managed as an extrapolation issue, grounded on observed actions. To enable this, InfoGCN++ incorporates Neural Ordinary Differential Equations, a concept that lets it effectively model the continuous evolution of hidden states. Following rigorous evaluations on three skeleton-based action recognition benchmarks, InfoGCN++ demonstrates exceptional performance in online action recognition. It consistently equals or exceeds existing techniques, highlighting its significant potential to reshape the landscape of real-time action recognition applications. Consequently, this work represents a major leap forward from InfoGCN, pushing the limits of what's possible in online, skeleton-based action recognition.

  • Research Article
  • Cite Count Icon 94
  • 10.1016/j.neucom.2023.03.001
Transformer for Skeleton-based action recognition: A review of recent advances
  • Mar 30, 2023
  • Neurocomputing
  • Wentian Xin + 5 more

Transformer for Skeleton-based action recognition: A review of recent advances

  • Research Article
  • Cite Count Icon 16
  • 10.1016/j.neucom.2022.07.046
Adaptive spatiotemporal graph convolutional network with intermediate aggregation of multi-stream skeleton features for action recognition
  • Jul 16, 2022
  • Neurocomputing
  • Yukai Zhao + 4 more

Adaptive spatiotemporal graph convolutional network with intermediate aggregation of multi-stream skeleton features for action recognition

  • Research Article
  • Cite Count Icon 80
  • 10.1016/j.neucom.2022.09.071
Action recognition based on RGB and skeleton data sets: A survey
  • Sep 20, 2022
  • Neurocomputing
  • Rujing Yue + 2 more

Action recognition based on RGB and skeleton data sets: A survey

  • Conference Article
  • 10.1117/12.2554374
Study on holographic special-purpose computer for wavefront printing technology
  • Apr 6, 2020
  • Yasuyuki Ichihashi + 6 more

We can record digitally-designed information of three-dimensional (3D) objects or optical elements on a holographic photosensitive material by using wavefront printing technology. But the hologram data generated from the digitally-designed information are very huge and there are often the occurrences of the unnecessary bidirectional communications. To solve this problem, we studied on a special-purpose computer for wavefront printing technology. This technique consists of generating the light-ray information from digitally-designed information of 3D objects, converting the light-ray information to the wavefront information and generating the hologram data locally from the wavefront information in interaction. In this paper, we designed the emulator of the special-purpose computer for wavefront printing technology and obtained the amount of information (the number of bits) required for the circuit by comparing the 3D images reconstructed from the holograms generated by the emulator. As a result, the amount of information of the wavefront information converted from the light-ray information most affected the quality of the 3D images reconstructed from the holograms generated by the emulator and we can design the emulator that can reduce the noise component from those 3D images. In the future, we will design the special-purpose computer for wavefront printing technology by using hardware description language and implement that special-purpose computer on a programmable logic device such as a field programmable gate array.

  • Conference Article
  • Cite Count Icon 11
  • 10.1109/siprocess.2019.8868548
Skeleton-based Action Recognition with Lie Group and Deep Neural Networks
  • Jul 1, 2019
  • Yanshan Li + 3 more

Skeleton-based action recognition has always been an important research topic of computer vision since the skeleton data is more robust to illumination and rotation. Traditional action recognition methods mainly rely on manual features. Among those methods, the skeleton feature representation modeled on Lie group can effectively describe the three-dimensional geometric relationship between joints. In recent years, deep learning methods such as Convolutional Neural Networks (CNN), Recurrent Neural Networks (RNN) and Long Short-Term Memory Networks (LSTM) have also achieved good performance in action recognition. In order to obtain more spatio-temporal information, we combine manual features and deep learning methods to propose a deep neural network LS-LieNet with CNN and Bi-direction LSTM (Bi-LSTM) based on the LieNet [1] network. First, the LS-LieNet network inputs the extracted Lie group representation of skeleton into a special CNN network which is designed for Lie group. Second, the transformed Lie algebra features are fed into the Bi-LSTM network before the fully connected layer of CNN. Then, the predicted labels and scores of the two network softmax layers are merged to effectively recognize the action. The experiment results on the standard 3D human action dataset show that the proposed LS-LieNet can efficiently improve the accuracy of action recognition.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 5
  • 10.1109/access.2021.3111633
Skeleton-Based Action Recognition With Low-Level Features of Adaptive Graph Convolutional Networks
  • Jan 1, 2021
  • IEEE Access
  • Jialin Gang + 3 more

Skeleton-based action recognition is a typical classification problem which plays a significant role in human-computer interaction and video understanding. Since a human skeleton has natural graphic features, methods based on graph convolutional networks (GCN) are widely applied in skeleton-based action recognition. Previous studies mainly focus on structural links in GCN to generate high-level features of human skeleton. However, low-level features are also important in many applications. For instance, low-level edge gradient and color information are important for image classification. This paper introduces a multi-branches structure to capture different low-level features of human skeleton. We combine both high-level and low-level features to recognize human action. We validate our method in action recognition with two skeleton datasets, NTU-RGB+D and Kinetics. Experiment results indicate that the proposed method achieves considerable improvement over some state-of-the-art methods.

  • Research Article
  • Cite Count Icon 4
  • 10.1145/3700878
Joint Mixing Data Augmentation for Skeleton-Based Action Recognition
  • Mar 10, 2025
  • ACM Transactions on Multimedia Computing, Communications, and Applications
  • Linhua Xiang + 1 more

Skeleton-based action recognition is beneficial for understanding human behavior in videos, and thus has received much attention in recent years as an important research area in action recognition. Current research focuses on designing more advanced algorithms to better extract spatio-temporal information from skeleton data. However, due to the small amount of data in the existing skeleton dataset and the lack of effective data augmentation methods, it is easy to lead to overfitting in model training. To address this challenge, we propose a mix-based data augmentation method, Joint Mixing Data Augmentation (JMDA), which can generally improve the effectiveness and robustness of various skeleton-based action recognition algorithms. In terms of spatial information, we introduce SpatialMix (SM), a method that projects the original 3D skeleton discrete information into a 2D space. Then, SM mixes the projected spatial information between two random samples during the training process to achieve the spatial-based mixing data augmentation. Concerning temporal information, we propose TemporalMix (TM). Leveraging the temporal continuity in skeleton data, we perform a temporal resize operation on the original skeleton data, and then merge two random samples during training to achieve the temporal-based mixed data augmentation. Additionally, we analyze the Feature Mismatch (FM) problem caused by introducing mix-based data augmentation into skeleton data. Then we propose a new data preprocessing method called Feature Alignment (FA) to effectively address this problem and improve model performance. Moreover, we propose a novel training pipeline, Joint Training Strategy (JTS), which combines multiple mix-based data augmentation methods for further improvement of model performance. Specifically, our proposed JMDA is plug-and-play and widely applicable to skeleton-based action recognition models. At the same time, the application of JMDA does not increase the model parameters and there is almost no additional training cost. We conduct extensive experiments on NTU RGB+D 60 and NTU RGB+D 120 datasets to demonstrate the effectiveness and robustness of the proposed JMDA on several mainstream skeleton-based action recognition algorithms.

Save Icon
Up Arrow
Open/Close
Notes

Save Important notes in documents

Highlight text to save as a note, or write notes directly

You can also access these Documents in Paperpal, our AI writing tool

Powered by our AI Writing Assistant