Abstract

Currently, human pose estimation (HPE) methods mainly rely on the design framework of Convolutional Neural Networks (CNNs). These CNNs typically consist of high-to-low-resolution subnetworks (encoder) to learn semantic information and low-to-high subnetworks (decoder) to raise the resolution for keypoint localization. Because too low-resolution feature maps in encoder will inevitably lose some spatial information, which cannot be recovered in the upsampling stages, keeping high spatial resolution features is critical for human pose estimation. On the other hand, due to scale variation of human body parts, multiscale features are also very important for human pose estimation. In this paper, a novel backbone network is proposed specifically for HPE, named High Spatial Resolution and Multiscale Networks (HSR-MSNet), which maintain high spatial resolution features in deeper layers of the encoder and meanwhile construct multiscale features within one single residual block via subgroup splitting and fusion of feature maps. Experiments show that our approach outperforms other state-of-the-art methods with more accurate keypoint locations on COCO dataset.

Highlights

  • Because too low-resolution feature maps in encoder will inevitably lose some spatial information, which cannot be recovered in the upsampling stages, keeping high spatial resolution features is critical to improve the performance of human pose estimation

  • The experiments with HSRNet and HSR-MSNet are implemented to investigate the effectiveness of keeping high spatial resolution features and multiscale features for human pose estimation, respectively

  • Since the feature maps of HSRNet have higher spatial resolution than ResNet, two deconvolution layers are utilized to maintain the same size of output heatmaps

Read more

Summary

Introduction

(1) There may be occlusion between different people, which will cause ambiguities of joints (2) Some invisible joints are hard to be predicted. In order to solve these challenges, existing methods, such as CPN [14] and SimpleBaseline [15], employ ResNet [16] as the backbone to obtain feature maps with large downsampling. Too large downsampling will cause image spatial information loss [17], leading to difficulties for joint. Previous works [9, 10, 18] have shown that multiscale or pyramid features are beneficial for solving the problems caused by scale changes. A novel backbone network is proposed for HPE, named High Spatial Resolution and Multiscale Networks (HSR-MSNet). The network could maintain high spatial resolution features in deeper layers while keeping large receptive fields and construct multiscale features within one single residual block by channel split and fusion. The network architecture of HSR-MSNet is very lightweight, which means that it will be possible to implement functions similar to MobileNet [19] on Internet-of-Things (IoT) devices

Related Works
Our Approach
Human Pose Estimation with High Spatial Resolution Features
Method
Experiments
Method a b c d
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call