Human action recognition aims to classify a given video according to which type of action it contains. Disturbance brought by clutter background and unrelated motions makes the task challenging for video frame-based methods. To solve this problem, this paper takes advantage of pose estimation to enhance the performances of video frame features. First, we present a pose feature called dynamic pose image (DPI), which describes human action as the aggregation of a sequence of joint estimation maps. Different from traditional pose features using sole joints, DPI suffers less from disturbance and provides richer information about human body shape and movements. Second, we present attention-based dynamic texture images (att-DTIs) as pose-guided video frame feature. Specifically, a video is treated as a space-time volume, and DTIs are obtained by observing the volume from different views. To alleviate the effect of disturbance on DTIs, we accumulate joint estimation maps as attention map, and extend DTIs to attention-based DTIs (att-DTIs). Finally, we fuse DPI and att-DTIs with multi-stream deep neural networks and late fusion scheme for action recognition. Experiments on NTU RGB+D, UTD-MHAD, and Penn-Action datasets show the effectiveness of DPI and att-DTIs, as well as the complementary property between them.
Read full abstract