Skeleton based human action recognition has evolved as one of the most important applications in multimedia IoT system. However, it requires extensive computation resource including high performance computing unites and large memory to train a deep mode with large number of parameters, which seriously limits it effectiveness and efficiency for edge intelligence multimedia IoT applications. In this paper, a knowledge distillation based light-weight deep model is proposed for skeleton human action recognition to meet the edge multimedia IoT applications. It can get competitive recognition performance in terms of learning accuracy for combination of AI model and edge surveillance equipment. On the one hand, to achieve desirable accuracy, we propose a deep pose-transition image representation method based on two-stream spatial–temporal architecture, which can mine the hidden features of color texture images in spatial and temporal domain, and fuse them for comprehensive discrimination before final classification. On the other hand, to increase the transfer learning ability to the student model on the edge device, we use tucker decomposition to weak the teacher model during knowledge transfer learning process. Finally, in order to validate the effectiveness of our proposal, we conducted extensive experiments to evaluate the proposed approach. The experimental results demonstrate that our proposal can realize deep model miniaturization to meet the requirement of edge multimedia IoT system and achieve the competitive performance.
Read full abstract