Abstract

Human pose estimation (HPE) is a fundamental yet promising visual recognition problem. Existing popular methods (e.g., Hourglass and its variants) either attempt to directly add local features element-wisely, or (e.g., vision transformers) try to learn the global relationships among different human parts. However, it remains an open problem to effectively integrate the local-global representations for accurate HPE. In this work, we design four feature fusion strategies on the hierarchical ResNet structure, including direct channel concatenation, element-wise addition and two parallel structures. Both two parallel structures adopt the naive self-attention encoder to model global dependencies. The difference between them is that one adopts the original ResNet BottleNeck while the other employs a spatial-attention module (named SSF) to learn the local patterns. Experiments on COCO Keypoint 2017 show that our SSF for HPE (named SSPose) achieves the best average precision with acceptable computational cost among the compared state-of-the-art methods. In addition, we build a lightweight running dataset to verify the effectiveness of SSPose. Based solely on the keypoints estimated by our SSPose, we propose a regression model to identify valid running movements without training any other classifiers. Our source codes and running dataset are publicly available.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call