Enhancing feature fusion for human pose estimation

Rui Wang,Jiangwei Tong,Xiangyang Wang

doi:10.1007/s00138-020-01104-2

Abstract

Current human pose estimation methods mainly rely on designing efficient Convolutional Neural Networks (CNN) frameworks. These CNN architectures typically consist of high-to-low resolution sub-networks to learn semantic information, and then followed by low-to-high sub-networks to raise the resolution to locate the keypoints. Because low-level features have high resolution but less semantic information, while high-level features have rich semantic information but less high resolution details, so it is important to fuse different level features to improve the final performance. However, most existing models implement feature fusion by simply concatenate low-level and high-level features without considering the gap between spatial resolution and semantic levels. In this paper, we propose a new feature fusion method for human pose estimation. We introduce high level semantic information into low-level features to enhance feature fusion. Further, to keep both the high-level semantic information and high-resolution location details, we use Global Convolutional Network blocks to bridge the gap between low-level and high-level features. Experiments on MPII and LSP human pose estimation datasets demonstrate that efficient feature fusion can significantly improve the performance. The code is available at: https://github.com/tongjiangwei/FeatureFusion .

Full Text