Abstract

Skeleton-based methods have made remarkable strides in human action recognition (HAR). However, the performance of existing unimodal approaches is still limited by the lack of diverse visual features in skeleton data. Concretely, due to the absence of interaction information between individuals and objects, skeleton-based methods tend to confuse similar actions. Moreover, the view invariant property of unimodal models is susceptible to restrictions. In this work, we propose an innovative skeleton-guided multimodal data fusion methodology that transforms depth, RGB, and optical flow modalities into human-centric images (HCI) based on keypoint sequences. Building upon this foundation, we introduce a human-centric multimodal fusion network (HCMFN), which can comprehensively extract the action patterns of different modalities. Our model significantly enhances the performance of skeleton-based techniques, achieving remarkable results with rapid inference speed. Extensive experiments on two large-scale multimodal datasets, namely NTU RGB+D and NTU RGB+D 120, validate the capacity of HCMFN to bolster the robustness of skeleton-based methods in two challenging HAR tasks: (1) discriminating between actions with subtle inter-class differences, and (2) recognizing actions from varying viewpoints. Compared to state-of-the-art multimodal methods, our HCMFN achieves exciting results.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.