Abstract
Human action recognition can benefit from multimodal information to address the classification problem under complex situations. However, existing works either use score fusion or perform simple feature integration methods to combine multiple heterogeneous modalities which failed to effectively utilize multimodal complementary information. In this paper, we proposed a Cross-Scale Cascade Multimodal Fusion Transformer (CSCMFT) to perform interaction and fusion among modalities of multi-scale features, thus obtaining a multimodal complementary representation for RGB-D-based human action recognition. Cross-Modal Cross-Scale Mixer (CCM) is the basic component in CSCMFT, which captures cross-modal relations and propagates the fused information across scales. Furthermore, our CSCMFT can still achieve significant improvements when applied to different multimodal combinations, indicating its generality and scalability. Experimental results show that CSCMFT fully exploits complementary semantic information between RGB and depth maps and outperforms state-of-the-art RGB-D-based methods on NTU RGB+D 60 & 120 and PKU-MMD datasets.
Accepted Version (Free)
Published Version
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have