Human 3D skeleton-based action recognition has received increasing interest in recent years. Inspired by the excellent ability of the multi-modal model, some pioneer attempts to employ diverse modalities, i.e., skeleton and language, to construct the skeleton-language model and have shown compelling results. Yet, these attempts model the data representation as deterministic point estimation, ignoring a key issue that descriptions of similar motions are uncertain and ambiguous, which brings about restricted comprehension of complex concept hierarchies and impoverished cross-modal alignment reliability. To tackle this challenge, this paper proposes a new Uncertain Skeleton-Language Learning Framework (USLLF) to capture the semantic ambiguity among diverse modalities in a probabilistic manner for the first time. USLLF consists of both inter- and intra-modal uncertainties. Specifically, first, we integrate the language (text) generated by ChatGPT with the generic skeleton-based network and develop a deterministic multi-modal baseline, which can be easily achieved via any off-the-shelf skeleton and text encoders. Then, based on this baseline, we explicitly model the intra-modal (skeleton/language) uncertainties as the Gaussian distributions using the new uncertainty networks capable of learning the distributional embeddings of modalities. Following this, these embeddings are aligned and formulated as inter-modal (skeleton-language) uncertainty using both the contrastive and negative log-likelihood objectives to alleviate the cross-modal alignment error. Experimental results on NTU RGB+D, NTU RGB+D 120, and NW-UCLA datasets show that our approach outperforms the proposed baseline and achieves comparable performance with a high inference efficiency compared to the state-of-the-art methods. Besides, we also deliver insightful analyses on how learned uncertainty reduces the impact of uncertain and ambiguous data on model performance.
Read full abstract