Human action recognition remains challenging due to the inherent complexity arising from the combination of diverse granularity of semantics, ranging from the local motion of body joints to high-level relationships across multiple people. To learn this multi-level characteristic of human action in an unsupervised manner, we propose a novel pretraining strategy along with a transformer-based model architecture named GL-Transformer++. Prior methods in unsupervised action recognition or unsupervised group activity recognition (GAR) have shown limitations, often focusing solely on capturing a partial scope of the action, such as the local movements of each individual or the broader context of the overall motion. To tackle this problem, we introduce a novel pretraining strategy named multi-interval pose displacement prediction (MPDP) that enables the model to learn the diverse extents of the action. In the architectural aspect, we incorporate the global and local attention (GLA) mechanism within the transformer blocks to learn local dynamics between joints, global context of each individual, as well as high-level interpersonal relationships in both spatial and temporal manner. In fact, the proposed method is a unified approach that demonstrates efficacy in both action recognition and GAR. Particularly, our method presents a new and strong baseline, surpassing the current SOTA GAR method by significant margins: 29.6% in Volleyball and 60.3% and 59.9% on the xsub and xset settings of the Mutual NTU dataset, respectively.
Read full abstract