Abstract

In scenarios where acquiring accurate annotated pose data is difficult, obtaining sufficient training samples becomes a challenge. Given these data constraints, it is challenging to provide optimal end-to-end supervision with limited pose information. To address this limitation, we propose a novel approach, named MSPose, which combines diverse data sources from human instances to provide multi-level supervision in human pose estimation. This approach mitigates the impact of scarce training data and enhances pose estimation precision by synergistically incorporating human bounding box and mask information. Our proposed method achieves multi-level model activation for the human body region, structure, and joints. This hierarchical pixel activation is facilitated by integrating human bounding box and mask signals as supplementary supervision during training. Importantly, we exclude auxiliary branches to ensure complete freedom during inference. Our experimental validation on a quarter of the COCO and MPII datasets demonstrates the competitiveness of our method compared to state-of-the-art methods. Compared to TokenPose-T, our MSPose-T achieves a 6.1 points improvement in average precision and a corresponding 3.1 points increase on the full COCO validation set. On the MPII dataset, MSPose achieves the highest Mean score of 1.9 points. Our approach achieves these improvements while maintaining consistent parameters and GFLOPs, highlighting the advantage of our approach.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call