Human skeleton, as a compact representation of action, has attracted numerous research attentions in recent years. However, skeletal data is too sparse to fully characterize fine-grained human motions, especially for hand/finger motions with subtle local movements. Besides, without containing any information of interacted objects, skeleton is hard to identify human–object interaction actions accurately. Hence, many action recognition approaches that purely rely on skeletal data have met a bottleneck in identifying such kind of actions. In this paper, we propose an Informed Patch Enhanced HyperGraph Convolutional Network that jointly employs human pose skeleton and informed visual patches for multi-modal feature learning. Specifically, we extract five informed visual patches around head, left hand, right hand, left foot and right foot joints as the complementary visual graph vertices. These patches often exhibit many action-related semantic information, like facial expressions, hand gestures, and interacted objects with hands or feet, which can compensate the deficiency of skeletal data. This hybrid scheme can boost the performance while keeping the computation and memory load low since only five extra vertices are appended to the original graph. Evaluation on two widely used large-scale datasets for skeleton-based action recognition demonstrates the effectiveness of the proposed method compared to the state-of-the-art methods. Significant accuracy improvements are reported using X-Sub protocol on NTU RGB+D 120 dataset.
Read full abstract