MEMe: A Mutually Enhanced Modeling Method for Efficient and Effective Human Pose Estimation.

Jie Li,Jianlin Zhang,Zhixing Wang,Hu Yang,Bo Qi

doi:10.3390/s22020632

Abstract

In this paper, a mutually enhanced modeling method (MEMe) is presented for human pose estimation, which focuses on enhancing lightweight model performance, but with low complexity. To obtain higher accuracy, a traditional model scale is largely expanded with heavy deployment difficulties. However, for a more lightweight model, there is a large performance gap compared to the former; thus, an urgent need for a way to fill it. Therefore, we propose a MEMe to reconstruct a lightweight baseline model, EffBase transferred intuitively from EfficientDet, into the efficient and effective pose (EEffPose) net, which contains three mutually enhanced modules: the Enhanced EffNet (EEffNet) backbone, the total fusion neck (TFNeck), and the final attention head (FAHead). Extensive experiments on COCO and MPII benchmarks show that our MEMe-based models reach state-of-the-art performances, with limited parameters. Specifically, in the same conditions, our EEffPose-P0 with 256 × 192 can use only 8.98 M parameters to achieve 75.4 AP on the COCO val set, which outperforms HRNet-W48, but with only 14% of its parameters.

Highlights

Since 2016, deep learning-based methods [1,2] have become a prime focus of research in Traditionally, to overcome the challenges in the scale variances and keypoint occlusions, various classic large models are proposed, such as stacked hourglass [6], CPN [7], SimpleBaseline [8], and HRNet [9]
Mean average precision from object keypoint similarity (OKS) is used as the evaluation metric on COCO, where OKS uses the Euclidean distance between the predicted keypoints and ground-truths to evaluate the similarity of keypoint pairs.The head-normalized probability of correct keypoint (PCKh) is the evaluation metric on MPII, which can detect whether the keypoints locate in the ground-truth adjacent range
We use our EEffPose-P0 for the ablation study, which is conducted on the COCO val set with an input size of 256 × 192

Summary

Introduction

Since 2016, deep learning-based methods [1,2] have become a prime focus of research in Traditionally, to overcome the challenges in the scale variances and keypoint occlusions, various classic large models are proposed, such as stacked hourglass [6], CPN [7], SimpleBaseline [8], and HRNet [9]. Stacked hourglass consists of multiple stacked hourglassshaped modules with intermediate supervision, which is the first multi-scale representation network architecture in human pose estimation, but is complex and inefficient. To solve this problem, CPN cascades only two pyramid nets with ResNet as the backbone, where one is a global net and the other is a refine net, to make better auxiliary supervision. As for HRNet, it is designed to maintain high-resolution representation by multi-scale parallel branches, which is more efficient and effective than ever, but still has largely redundant parameters and high complexity, and is not suitable for real deployment.

Methods

Results

Conclusion