Articles published on activity-recognition
Authors
Select Authors
Journals
Select Journals
Duration
Select Duration
14514 Search results
Sort by Recency
- Research Article
- 10.3389/fmars.2026.1749059
- Apr 28, 2026
- Frontiers in Marine Science
- H Seckin Demir + 4 more
Automated recognition of animal behaviors is an important computer vision task that improves ecological monitoring and behavioral analysis. Compared to generic human action recognition, these applications often suffer from severe constraints such as low onsite computational power, limited data availability for training learning-based models, and suboptimal image quality due to environmental conditions. For sea turtles, behavior in relation to fishing gear is particularly important for understanding and reducing the bycatch, or incidental take, and associated mortality. Monitoring such behavior underwater is challenging because viewing angles vary over time, and pose and motion trajectories are highly dependent on the camera angle. In this study, we address this problem with a compact pose-to-action pipeline that detects a small set of turtle morphological keypoints in each frame, and then classifies short sequences of keypoints into U-turn, reversal, or other routine behaviors. While we employ the YoloV8n pose model for keypoint detection, we use a shallow fully connected network for classifying the behavior types. Our two-stage training strategy allows us to train our pose estimation network with real data while optimizing the behavior recognition network with both real annotated clips and a large set of simulated trajectories including various camera geometries and motion parameters. We further reduced the computational requirements by finding a balance between the input frame rate and recognition accuracy. Our experimental results show that we can achieve 93.2% recognition accuracy with a minimum frame rate requirement of 2.86fps.
- Research Article
- 10.1007/s10044-026-01674-3
- Apr 28, 2026
- Pattern Analysis and Applications
- Yuting Wang + 3 more
Tgunet: hierarchical temporal modeling with transformer-GRU for CSI-based fine-grained human activity recognition
- Research Article
- 10.36548/jtcsst.2026.2.007
- Apr 27, 2026
- Journal of Trends in Computer Science and Smart Technology
- Swati Gautam + 1 more
In this paper, an adaptive gated multimodal fusion framework for generalized and robust multimodal Human Activity Recognition (HAR) systems with heterogeneous sensing modalities like skeletal pose estimation and inertial measurement units (IMUs) is proposed. The proposed framework employs deterministic data harmonization through anterior harmonization and a reliability-based gated multimodal fusion mechanism to improve the robustness and generalization capability of multimodal HAR systems. The proposed gated multimodal fusion mechanism has been mathematically derived to approximate the inverse-variance weighting mechanism to obtain stability in the presence of modality-dependent noise and avoid posterior domain adaptation techniques. To improve temporal alignment between multimodal data streams, frequency domain analysis has been used to justify resampling at a unified 30 Hz rate to meet the Nyquist criterion. The proposed framework has been evaluated using the NTU RGB+D 120, UTD-MHAD, and PAMAP2 datasets with statistically significant results over static baselines (p<0.05, d=2.1), and low computational costs to meet edge-constrained IoT sensing requirements.
- Research Article
- 10.1142/s021951942650048x
- Apr 24, 2026
- Journal of Mechanics in Medicine and Biology
- Chenxi Lu + 1 more
To address the problem of low recognition rate caused by the difficulty in capturing highspeed and subtle movements in table tennis, this work proposes a motion recognition method based on multimodal data and an optimized Spatial-Temporal Graph Convolutional Network (ST-GCN). The model introduces a Multi-Level Graph Convolutional Network (ML-GCN) architecture and constructs cross-level feature extraction channels, which effectively capture the spatiotemporal correlations between local subtle movements and global trajectories. The built-in hybrid attention mechanism realizes precise focusing on key skeletal nodes and core motion frames through adaptive weight assignment. Combined with the multimodal fusion strategy of visual signals and inertial sensor data, it significantly enhances the robustness of the model in scenarios with line-of-sight occlusion and motion blur. Test results based on a self-built multimodal table tennis dataset show that this method achieves an accuracy of 88.2%, a recall rate of 89.5% and an F1-score of 88.3%. This performance is significantly superior to the original ST-GCN and existing mainstream motion recognition algorithms, which confirms the core role of each optimization module in improving feature representation capability and computational efficiency. The study provides an efficient technical solution for the intelligent analysis of complex sports movements.
- Research Article
- 10.1142/s0219519426500454
- Apr 23, 2026
- Journal of Mechanics in Medicine and Biology
- Chuihu Yin + 2 more
In college physical education teaching, college students’ athletic ability assessment still faces problems of strong subjectivity in scoring and difficulty in quantifying complex movement characteristics. Traditional methods struggle to capture multi-granularity information of skeletons in continuous movements, and the cost of acquiring annotated data is high. This study aims to construct a set of quantitative assessment models for athletic ability oriented to college physical education classrooms. It realizes the joint optimization of action recognition and athletic ability scoring through the multi-granularity spatial-temporal graph convolutional network (MGSTGC). The MG-STGC model uses an encoder to extract joint-level, limb-level, and body-level features. It combines labeled and unlabeled data via semi-supervised learning strategies to achieve the joint optimization of action recognition and quantitative assessment of athletic ability. The athletic ability assessment module can generate continuous scores across four dimensions: strength, stability, standardization, and coordination. These scores are obtained through spatiotemporal statistical mapping of historical action segments and skeleton features, providing data support for individualized training. On the NTU RGB+D dataset, MG-STGC achieves a Top-1 accuracy of 95.6% and 89.7% on the X-view and X-sub benchmarks. On the FineGym dataset, it reaches 80.5% Top-1 accuracy on the Gym99 subset and 75.4% on the Gym288 subset, with category average accuracies of 69.8% and 62.6%, respectively, outperforming baseline models. Ablation experiments show that the granularity information fusion module and parameters have an impact on model performance. Research shows that MGSTGC can efficiently capture multi-granularity information of action skeletons and provide an objective and quantitative method for athletic ability assessment in college physical education classrooms. MG-STGC also lays a feasible theoretical and practical foundation for intelligent physical education teaching and personalized training.
- Research Article
- 10.64751/ajmimc.2026.v5.n2(1).295
- Apr 23, 2026
- American Journal of Management and IOT Medical Computing
- C Vijaya Raj + 3 more
The widespread use of wearable sensor technologies has led to an exponential increase in continuously generated motion data, creating new opportunities for automated human activity recognition in areas such as healthcare monitoring, fitness tracking, assisted living, and smart environments. Conventional approaches that rely on manual observation or rule-based classification with predefined thresholds are often inadequate for handling complex activity patterns, sensor noise, and high-dimensional data streams, resulting in reduced accuracy and limited generalization capabilities. A major challenge in this domain is the reliable interpretation of continuous, multidimensional sensor data under dynamic conditions, including variations in user behavior, device orientation, and sensor placement. To address these limitations, this study proposes a robust and scalable machine learning-based framework for activity classification that leverages multiple algorithms, including Greedy Tree (GT), K-Nearest Neighbors (KNN), Logistic Regression (LR), Naïve Bayes (NB), and Adaptive Boosting (AB). The system is designed as an end-to-end pipeline incorporating essential stages such as data cleaning, normalization, feature scaling, model training, performance evaluation, and prediction. By systematically comparing different models, the results demonstrate that the Greedy Tree classifier significantly outperforms other techniques, achieving an accuracy of 99.00% on the target activity variable, while KNN, NB, LR, and AB achieve comparatively lower accuracies of 77.85%, 57.40%, 51.10%, and 51.02%, respectively. This indicates the superior capability of tree-based models in capturing complex patterns and decision boundaries within sensor data. Overall, the proposed framework enhances classification accuracy, improves robustness against noisy and variable data, and ensures scalability for real-time as well as batch processing, making it highly suitable for deployment in modern intelligent monitoring systems.
- Research Article
- 10.1038/s41598-026-49915-z
- Apr 22, 2026
- Scientific reports
- Chen Junzhang + 1 more
A hybrid framework combining adaptive graph learning and global temporal attention for skeleton-based action recognition.
- Research Article
- 10.1038/s41598-026-48833-4
- Apr 22, 2026
- Scientific reports
- Sai Zhang + 3 more
Multivariate time series classification (MTSC) is a critical task in fields such as human activity recognition, medical diagnosis, and industrial process monitoring. Its core problem lies in effectively capturing the complex nonlinear dynamics within and between multidimensional variables. Reservoir computing (RC), as an efficient feature extraction model, offers advantages including low computational resource requirements and fast training speeds. However, its performance remains highly dependent on hyperparameter tuning, and standard models may fail to fully leverage the global statistical properties of sequences. In order to address these problems, we propose HERA (Hybrid Euler Reservoir Architecture), a novel classification framework specifically designed for MTSC. At the core of HERA lies an innovative hybrid feature design that synergistically integrates two complementary types of information: (1) Dynamic representations of inter-variable interactions captured by the Euler State Network (EuSN); (2) Static statistical features summarizing the global characteristics of each independent variable. Additionally, the architecture embeds a self-optimization module driven by the Covariance Matrix Adaptation Evolution Strategy (CMA-ES) to efficiently determine optimal model configurations. Extensive experiments across multiple public MTSC benchmark datasets demonstrate that HERA achieves highly competitive classification accuracy, performing on par with or exceeding various strong baseline models.
- Research Article
- 10.1145/3811542
- Apr 21, 2026
- ACM Transactions on Internet of Things
- Moid Sandhu + 6 more
Wearable devices are becoming more prevalent in people’s daily lives, particularly in applications such as activity recognition, health monitoring and fitness management. However, the majority of existing wearable devices remain heavily dependent on battery-based power sources, which introduces several practical and sustainability challenges. Frequent battery replacement or recharging imposes inconvenience on users, increases long-term operational costs, and contributes to environmental concerns associated with battery disposal and resource consumption. To address these issues, we present KineticWear, the first battery-free wearable system that utilises kinetic energy harvested from human activities both as the sole energy source and as a sensing signal for on-device human activity recognition (HAR). Based on a careful end-to-end design of all hardware and software components, KineticWear achieves real-time HAR on an ultra low-power microcontroller unit (MCU) including on-board classification and transmission of the inferred activity over a wireless link. Using empirical data, we find that decision tree (DT) and convolutional neural network (CNN) models offer activity recognition accuracies of 87 % and 99.5 % respectively. Systematic real-world experiments demonstrate that KineticWear harvests sufficient energy to operate the wearable device up to 95.2 % of the time, and that the device can infer and report an ongoing activity within 8 seconds using DT classification algorithm, taking three orders of magnitude shorter classification time than CNN. Thus, KineticWear offers significantly enhanced performance compared to state-of-the-art off-device activity recognition systems powered by kinetic energy harvesting.
- Research Article
- 10.55041/ijsrem60740
- Apr 21, 2026
- INTERNATIONAL JOURNAL OF SCIENTIFIC RESEARCH IN ENGINEERING AND MANAGEMENT
- Dinesh R Thorat + 4 more
Abstract - In recent times, maintaining safety and academic integrity within educational institutions has become a critical challenge. While CCTV cameras are extensively deployed across campuses and examination halls, they rely heavily on manual monitoring, which is inefficient, prone to human error, and causes delayed responses to critical events. This paper proposes a novel, multi-scenario automated surveillance framework based on advanced deep learning and computer vision techniques. The proposed system is designed to operate in two distinct modes: a Campus Safety mode and an Academic Integrity mode. For violence detection, the architecture integrates YOLOv7 for rapid human detection with a MobileNet-BiLSTM classifier to accurately recognize violent actions. For examination monitoring, the system utilizes Deep Keyframe Detection combined with a Multilayer Perceptron (MLP) enhanced YOLOv8 algorithm (SE-YOLOv8) and a ResNet-based 3D CNN to identify subtle cheating behaviors like passing notes or unauthorized looking. By unifying these dual pipelines into a single deployable framework, this study outlines a scalable solution that minimizes manual invigilation, enhances real-time threat detection, and ensures a secure, fair educational environment. Key Words: Multi-Scenario Surveillance, Deep Learning, Violence Detection, Smart Proctoring, YOLO Framework, Action Recognition, Deep Keyframe Detection, Computer Vision.
- Research Article
- 10.55041/ijsrem59705
- Apr 21, 2026
- INTERNATIONAL JOURNAL OF SCIENTIFIC RESEARCH IN ENGINEERING AND MANAGEMENT
- Suraj Kumar Chaubey + 4 more
Abstract Abstract — Sign language serves as the primary and natural medium of communication for the Deaf and Hard of Hearing (DHH) community worldwide. Despite its linguistic richness and social significance, the automated recognition of sign language by computational systems remains a formidable research challenge. This paper presents a real-time Sign Language Detection system grounded in action recognition principles and powered by Long Short-Term Memory (LSTM) deep learning networks. The system leverages Mediapipe Holistic for accurate multi-landmark extraction across hands, face, and body, and employs a deep stacked LSTM architecture to model the temporal dynamics and sequential dependencies inherent in sign language gestures from continuous live video. A comprehensive training pipeline encompassing video acquisition, morphological preprocessing, key-point feature extraction, sequence formation, and hyperparameter-optimized LSTM training is proposed and evaluated. The system is validated on a multi-class gesture dataset under varied conditions. Experimental outcomes demonstrate a training accuracy of 96.2%, a validation accuracy of 91.5%, and a test accuracy of 87.3%, surpassing traditional static frame-based CNN methods by approximately 13 percentage points. Performance is assessed across precision, recall, F1-score, and real-time inference latency (~44 ms/frame), confirming robustness and practical usability. This research contributes a scalable, cost-effective, and deployable solution that bridges the communication gap for DHH individuals, facilitating their inclusion in educational, healthcare, and everyday social contexts. Keywords — Sign Language Recognition (SLR), LSTM, Deep Learning, Action Recognition, Mediapipe Holistic, Gesture Recognition, Computer Vision, Temporal Modeling, DHH Communication, OpenCV, TensorFlow
- Research Article
- 10.1145/3811026
- Apr 21, 2026
- ACM Transactions on Human-Robot Interaction
- Yongqiang Jiang + 2 more
To investigate the use of different pointing forms in service scenarios, we collected the ShopPoint 1 dataset, a skeleton-based dataset of pointing gestures from customer-shopkeeper interactions in a camera shop scenario. 13 participants took part in the data collection, including 3 shopkeepers with real-world customer service experience and 10 customers. We recorded 61 one-to-one role-played interactions. Coders annotated pointing gestures from videos of these interactions, emphasizing pointing arm forms (straight-arm, bent-arm, and hand-only pointing) and hand forms (index-finger and open-hand pointing). This annotation process resulted in 2959 pointing gestures. We conducted statistical analysis on the annotated data. The analysis revealed that bent-arm pointing was used more frequently than other arm forms. Straight-arm pointing was used more for far targets than for close targets, and hand-only was used more for close targets. Shopkeepers used bent-arm pointing more frequently than customers when referring to far targets. To evaluate the recognition of these pointing gestures, we tested several existing Skeleton-based Action Recognition (SAR) methods on the dataset. The highest accuracy was achieved at 72.51% by using transfer learning (i.e., pre-training and fine-tuning). This evaluation indicates that though transfer learning aids performance, recognizing pointing with diverse forms remains challenging.
- Research Article
- 10.7717/peerj-cs.3691
- Apr 21, 2026
- PeerJ Computer Science
- Tianjing Zhang + 1 more
Recognizing subtle and complex dance movements requires models that can capture detailed spatial cues in individual frames while also tracking long-range temporal dynamics across entire performances. We propose a hybrid deep learning framework that first uses a modern convolutional neural network, pre-trained on large-scale image data, to extract multi-scale spatial features from each video frame and then applies a bidirectional recurrent neural network with a temporal attention mechanism to emphasize the most informative motion segments when aggregating features over time. The model is trained and evaluated on a video dataset of traditional Balinese dance movements, where it achieves higher classification accuracy and greater robustness to viewpoint changes and partial occlusion than several existing deep learning and classical machine learning baselines. Ablation experiments show that both the strong spatial feature extractor and the temporal attention module contribute significantly to performance gains. The resulting framework is compact, data-efficient, and easily adaptable to other cultural dance archives and more general tasks involving spatiotemporal human action recognition.
- Research Article
- 10.1371/journal.pone.0347245
- Apr 20, 2026
- PloS one
- Yupaporn Wanna + 2 more
Human action recognition has become increasingly important for applications in security surveillance, healthcare monitoring, and smart environments. However, existing deep learning models typically require substantial computational resources, making deployment on resource-constrained edge devices challenging. To address this limitation, we propose TinyAct, a lightweight framework for real-time human action recognition that combines edge computing with cloud-based processing through knowledge distillation. TinyAct employs a 3D video autoencoder to extract compact spatiotemporal features from video sequences, coupled with classical machine learning classifiers for action prediction. The framework utilizes an AIoT (Artificial Intelligence of Things) architecture where feature extraction occurs on edge devices while classification is performed in the cloud, enabling real-time processing with reduced bandwidth requirements. To enhance performance, we implement knowledge distillation using the ILA-ViT-B/16 transformer as a teacher model to transfer temporal knowledge to our compact student architecture. Our experiments on the Kinetics-400 dataset demonstrate that TinyAct achieves competitive performance while maintaining computational efficiency. Using 16-frame video clips with 1024-dimensional latent features, Random Forest achieved the highest baseline accuracy of 57.00%, followed by SVM (55.00%) and XGBoost (54.00%). The autoencoder-based feature extraction significantly reduces computational overhead compared to end-to-end deep learning approaches while preserving essential spatiotemporal information for accurate action recognition. The knowledge distillation experiments reveal that training configuration critically affects performance, with non-pretrained student models achieving better results (15.11% with SVM) than pretrained ones under teacher supervision. This suggests that joint optimization of the encoder and classifier is essential for effective knowledge transfer in resource-constrained settings. TinyAct's modular architecture enables flexible deployment across diverse hardware configurations, supporting both lightweight edge inference and cloud-based training pipelines. The framework demonstrates that effective human action recognition can be achieved without computationally intensive deep networks, making it suitable for smart surveillance systems, IoT applications, and embedded devices where computational resources are limited.
- Research Article
- 10.3390/s26082532
- Apr 20, 2026
- Sensors (Basel, Switzerland)
- Daniël Benjamin Keyter + 1 more
This paper presents a multimodal sensing approach for fine-grained soccer action recognition using synchronized mm-wave FMCW radar and multiview RGB cameras. A TI IWR1443BOOST FMCW radar and three Sony IMX296 global-shutter cameras were used to record seven soccer-related actions in different movement directions in an outdoor environment. Range-Doppler radar processing is applied to extract global mel features and CFAR-localized block representations of mel and radar spectrogram features to capture both coarse and fine micro-Doppler characteristics. Camera features are derived from bounding box, HOG, optical flow, and pose estimations. Classification is performed using logistic regression as the classical model and various deep models. Performance is evaluated using cross-validation. Radar alone achieved moderate performance (0.897 F1macro using TCN), successfully identifying coarse motion but showing limited separability for dribbling-based actions. Camera-only models achieve near-perfect accuracy (≥0.997 F1macro using 1D-CNN), with the confusion matrices being nearly perfectly diagonal already. The best performance is obtained from a cross-modal transformer with multiple cameras (0.998 F1macro). These results demonstrate that a camera by itself performs strongly for the action recognition task but also that radar-camera fusion can improve robustness and enhance the discrimination of finer soccer player movements for outdoor analytics and player monitoring applications.
- Research Article
- 10.53623/amms.v2i2.999
- Apr 20, 2026
- Advanced Mechanical and Mechatronic Systems
- Ade Kurniawan + 6 more
Human Activity Recognition (HAR) using smartphone inertial measurement unit (IMU) sensors has emerged as a transformative technology for health monitoring, fitness tracking, and context-aware computing. However, existing HAR research is constrained by limited data availability, short recording durations, and single-limb sensing configurations. This study addresses these limitations through three principal contributions: (1) introduction of a novel open-access multi-limb HAR dataset featuring synchronized 60-second recordings from hand and ankle positions using tri-axial accelerometer, gyroscope, and magnetometer sensors, publicly available via Mendeley Data repository; (2) systematic benchmarking of classical machine learning classifiers including Random Forest, XGBoost, and Linear Support Vector Classifier under realistic multi-sensor fusion conditions; and (3) comprehensive analysis of model robustness across varying windowing configurations. The dataset comprises recordings from six participants performing six daily activities (walking, stair ascent, stair descent, standing, sitting, lying), totaling over 72 minutes of synchronized multi-sensor data. Experimental evaluation demonstrates that Random Forest achieves superior classification accuracy of 96.13%, significantly outperforming XGBoost (85.22%) and LinearSVC (58.54%). The publicly released dataset and benchmarking results provide a valuable resource for the HAR research community, enabling reproducible experimentation and facilitating advancement in multi-limb activity recognition systems.
- Research Article
- 10.3389/frvir.2026.1794720
- Apr 20, 2026
- Frontiers in Virtual Reality
- Ke Li + 6 more
Intelligent Virtual Agents (IVAs), which embody an artificial intelligence (AI) in a humanoid representation, have enormous potential for immersive extended reality (XR) environments to enable natural and engaging human-AI interactions. With recent advances in large language models (LLMs) in simulating human-like text responses, interest in anthropomorphic embodied IVAs has grown across extended reality (XR) research and application domains. However, toolkits for authoring and interacting with IVAs in research remain sparse. Therefore, we present Anthropomorphic AI , a flexible and scalable open-source research toolkit for authoring and interacting with embodied IVAs with rich multimodal capabilities, including speech, gaze, gestures, facial expressions, and vision. Our system enables developers to create various embodied anthropomorphic IVAs by customizing behavior through expressive nonverbal cues, selecting and combining different foundation models, speech-to-text (STT) and text-to-speech (TTS) methods, and adapting the system prompt to guide interaction. We also integrate various features such as proximity detection, trajectory-based action recognition, and vision-based multimodal prompting for supporting natural human-IVA interaction in immersive XR. We evaluate the toolkit through four use case demonstrations, a pilot developer evaluation, and an pilot end-user evaluation in immersive VR, showing its capability in generating anthropomorphic IVAs for immersive XR applications.
- Research Article
- 10.64388/irev9i10-1716485
- Apr 20, 2026
- Iconic Research and Engineering Journals
Machine Learning for Sensor-Based Human Activity Recognition
- Research Article
- 10.3390/make8040107
- Apr 18, 2026
- Machine Learning and Knowledge Extraction
- Sakorn Mekruksavanich + 1 more
The rapid growth of the elderly population worldwide demands reliable activity recognition technologies to support independent living and continuous health supervision. However, conventional wearable sensor-based human activity recognition (HAR) techniques often fail to capture the complex temporal behaviour and subtle motion patterns characteristic of the elderly. To address these limitations, this study introduces a hybrid deep residual architecture—CNN-CBAM-BiGRU—that integrates convolutional neural networks (CNNs), the convolutional block attention module (CBAM), and bidirectional gated recurrent units (BiGRUs) to improve activity recognition using inertial measurement unit (IMU) data. In the proposed CNN-CBAM-BiGRU framework, CNN layers automatically derive representative features from raw sensor signals, CBAM applies adaptive channel and spatial attention to highlight informative patterns, and BiGRU captures long-range temporal relationships within activity sequences. The approach was evaluated on three benchmark datasets designed for elderly populations—HAR70+, HARTH, and SisFall—covering daily activities and fall events. The proposed model consistently outperforms existing methods across all datasets, achieving accuracies exceeding 96%, F1-scores above 93%, and a fall detection recall of 93.74%, confirming its robustness and suitability for safety-critical monitoring applications. Class-level evaluation indicates excellent recognition of static postures and consistent performance for dynamic actions. Convergence analysis further confirms efficient learning with limited overfitting across datasets. The proposed framework thus provides a robust and accurate solution for wearable-based elderly activity recognition, with strong potential for deployment in fall detection, health monitoring, and ambient assisted living systems.
- Research Article
- 10.3390/bdcc10040125
- Apr 18, 2026
- Big Data and Cognitive Computing
- Khadija Lasri + 4 more
While Graph Convolutional Networks (GCNs) have revolutionized skeleton-based action recognition, existing methods face a critical efficiency–accuracy dilemma: state-of-the-art approaches achieve high performance through computationally expensive multi-stream fusion (joint, bone, joint motion, and bone motion) and deep architectures, limiting real-world deployment on resource-constrained devices. We propose LST-AGCN (Lightweight Spatial–Temporal Attention Graph Convolutional Network), introducing three technical contributions that address this challenge: (1) Unified Attention Module (UAM)—a framework that integrates channel, spatial, and temporal attention through a single compact operation, significantly reducing attention parameters compared to separate attention mechanisms; (2) Depthwise Separable Attention Mechanism (DSAM)—a factorization using depthwise separable convolutions that achieves linear complexity reduction from O(C2) to O(C) in attention operations; and (3) Efficient Topology-Aware Fusion (ETAF)—an adaptive Joint-wise Attention strategy that captures fine-grained spatial relationships without quadratic complexity growth. Extensive experiments on NTU RGB+D 60 and NTU RGB+D 120 datasets demonstrate that LST-AGCN achieves strong performance using only joint modality (86.14%/94.0% and 79.5%/82.0% Top-1 accuracy with 99.0% Top-5 on cross-view) while requiring 14.11 M parameters and 19.02 GFLOPs, delivering efficient inference suitable for edge deployment.