Temporal cues enhanced multimodal learning for action recognition in RGB-D videos

Dan Liu,Fanrong Meng,Qing Xia,Zhiyuan Ma,Jinpeng Mi,Yan Gan,Mao Ye,Jianwei Zhang

doi:10.1016/j.neucom.2024.127882

Abstract

Action recognition is an important and active research direction in computer vision, where temporal modeling is critical for action representation. Generally, unimodal methods that use only RGB or skeleton modality for human action recognition have their limitations, e.g., information redundancy/environment noise of RGB video modality, and spatial interaction deficiency of skeleton modality. In this paper, we present a novel multimodal learning approach based on RGB and skeleton modalities for action recognition in RGB-D videos. Specifically, we (1) transfer skeleton knowledge to RGB video for effective video compression, which produces the informative action image from raw RGB video, (2) introduce the temporal cues enhancement module to adequately learn the spatiotemporal representation for action classification, and (3) propose a multi-level multimodal co-learning framework for human action recognition in RGB-D videos. Experimental results on NTU RGB+D, PKU-MMD, and N-UCLA datasets demonstrate the effectiveness of the proposed multimodal learning method.

Full Text