Designing Compact Convolutional Filters for Lightweight Human Pose Estimation

Shili Niu,Wenchuan Zhang,Fei Long,Wu Zeng,Shihua Feng,Weihua Ou,Jianping Gou,Chi-Hua Chen

doi:10.1155/2021/1333250

Abstract

Existing methods for human pose estimation usually use a large intermediate tensor, leading to a high computational load, which is detrimental to resource-limited devices. To solve this problem, we propose a low computational cost pose estimation network, MobilePoseNet, which includes encoder, decoder, and parallel nonmaximum suppression operation. Specifically, we design a lightweight upsampling block instead of transposing the convolution as the decoder and use the lightweight network as our downsampling part. Then, we choose the high-resolution features as the input for upsampling to reduce the number of model parameters. Finally, we propose a parallel OKS-NMS, which significantly outperforms the conventional NMS in terms of accuracy and speed. Experimental results on the benchmark datasets show that MobilePoseNet obtains almost comparable results to state-of-the-art methods with a low compilation load. Compared to SimpleBaseline, the parameter of MobilePoseNet is only 4%, while the estimation accuracy reaches 98%.

Highlights

Human pose estimation is called human key point detection
In this paper, we propose a lightweight human pose estimation network for mobile and resource-constrained environments by designing compact convolutional filters
The contributions of the proposed method are summarized as follows: (i) We design a lightweight upsampling block that integrates separable transpose convolution and channelbased attention. This is achieved by extensively examining the upsampling modules in existing state-of-the-art deep convolutional networks (ii) We reduce the number of upsampling and use lightweight upsampling blocks to achieve a lightweight pose estimation network

Summary

Introduction

Human pose estimation is called human key point detection. Its main task is to detect the key points of human body (eyes, nose, shoulders, elbows, etc.) in a given RGB picture. With the quick development of neural networks, human pose estimation based on deep neural networks [4,5,6,7,8,9] has gained a high accuracy. These works have focused only on improving the accuracy of pose estimation through the use of complex and computationally expensive models, while largely ignoring the issue of the cost of model inference. Information security is a growing concern for people, and it is important to deploy applications directly on edge devices for personal information protection, which leads to high requirements for the computational volume and complexity of human pose estimation models

Methods

Results

Conclusion