Human action recognition, a crucial task for understanding complex human behavior from sensor data, faces significant challenges. This study specifically addresses two of these challenges: the complex interplay between joint positions and their velocities in action data, and improving model generalizability in the context of limited labeled data. To this end, we present the dual-stream cross-fusion and class-aware memory bank (DSCF-CAMB) framework, a novel semi-supervised approach. It employs a dual-stream cross-fusion (DSCF) mechanism to independently encode joint and velocity data streams, and through cross-fusion, enriches the action representation by exploring correlations between joint positions and their velocities. Meanwhile, to address the challenges posed by unlabeled data, we have developed a contrastive loss based class-aware memory bank (CAMB) that carefully selects high-confidence positive and negative sample embeddings and filters out potential noise-contributing samples. This strategy significantly improves the robustness and accuracy of the framework in action recognition. The experimental results show that our method outperforms the state-of-the-art methods by an average of 2.7% and 1.5% in top-1 accuracy on the renowned NTU RGB+D and NW-UCLA datasets, respectively, demonstrating the superiority of our method in human action recognition.
Read full abstract