Abstract

Action recognition is an important and active research direction in computer vision, where temporal modeling is critical for action representation. Generally, unimodal methods that use only RGB or skeleton modality for human action recognition have their limitations, e.g., information redundancy/environment noise of RGB video modality, and spatial interaction deficiency of skeleton modality. In this paper, we present a novel multimodal learning approach based on RGB and skeleton modalities for action recognition in RGB-D videos. Specifically, we (1) transfer skeleton knowledge to RGB video for effective video compression, which produces the informative action image from raw RGB video, (2) introduce the temporal cues enhancement module to adequately learn the spatiotemporal representation for action classification, and (3) propose a multi-level multimodal co-learning framework for human action recognition in RGB-D videos. Experimental results on NTU RGB+D, PKU-MMD, and N-UCLA datasets demonstrate the effectiveness of the proposed multimodal learning method.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call