Fast and robust key frame extraction method for gesture video based on high-level feature representation

Huimin Yang,Linye Li,Qinglong Liang,Qiuhong Tian,Qiaoli Zhuang

doi:10.1007/s11760-020-01783-4

Abstract

In gesture video, the inner-frame difference is too subtle to be projected via low-level features, and the gesture frames, expressing semantic information, are distributed only among the tiny part of the whole video frame. This paper introduces a fast and robust key frame extraction method for gesture video, founded upon high-level feature representation to extract the gesture key frame precisely without affecting the semantic information. Firstly, a gesture video segmentation model is designed by employing SSD, which classify gesture video into the semantic scene and the static scene. And then, the 2D-DWT-based perceptual hash algorithm is studied to extract candidate static key frames. Afterward, the multi-channel gradient magnitude frequency histogram (HGMF-MC) based on improved VGG16 is developed as a new image descriptor. Finally, a key frame extraction mechanism based on HGMF-MC is proposed to generate gesture video summary of two scenes, respectively. Experiments consistently show the superiority of the proposed method on Chinese sign language, Cambridge, ChaLearn and CVRR-Hands gesture datasets. The results demonstrate that the method proposed is effective, which improves the video compression ratio and outperforms the state-of-the-art methods.

Full Text