Skeleton-based action recognition relies on capturing connections between joints to extract action-specific features. Current approaches utilizing temporal convolution for inter-frame joint relationships only capture connections within the same joint across different frames. It neglects potential connections between joints in different frames. However, modeling all joints across frames will encode a plethora of redundant features that impede the recognition of action. In this study, a novel and effective skeleton window attention mechanism (SWAM) is proposed that consists of window attention module (WAM) and spatial–temporal attention regularization (STAR). WAM directly capture short-range features within isolated windows. It consolidates all joints from adjacent frames into the window and performs full attention operations on them. STAR enhances the model’s focus on structural action features, counteracting the vanilla attention mechanism’s oversight. Furthermore, this study introduces the cross-window mechanism (CWM) that consists of three components: cross-window fusion (CWF), cross-window shift (CWS), and cross-window aggregation (CWA). These components collectively break the limitations imposed by window boundaries, enabling efficient interactions between windows and facilitating the capture of long-range dependencies. Integrating SWAM and CWM, we introduce the skeleton window attention network (SWA-Net), which embodies a novel approach of incomplete decoupling modeling. It can extract action features from different joints in different frames without capturing excessive redundant features. Compared with the state-of-the-art methods, it achieves better results on NTU-RGB+D, NTU-RGB+D 120 and Kinetics-Skeleton 400 datasets.
Read full abstract