Abstract

Human pose estimation in crowded scenes is a challenging task. Due to overlap and occlusion, it is difficult to infer pose clues from individual keypoints. We proposed PFFormer, a new transformer-based approach that treats pose estimation as a hierarchical set prediction problem that first focuses on human windows and coarsely predicts whole-body poses globally within them. In PFFormer, we designed a Windows Clustering Transformer (WCT), which reorganizes the image windows by filtering the attentive windows and fusing the inattentive ones, allowing the transformer to focus on the important regions while reducing the interference from the complex background, followed by compensating for the loss of information with a global transformer. Then we partition the learned body pose into a set of structural parts and perform the Inter-Part Relation Module (IPRM) to capture the correlation between multiple parts. These full-body poses and component features are refined at a finer level through the Part-to-Joint Decoder (PJD). Extensive experiments show that PFFormer performs favorably against its counterpart on challenging datasets, including COCO2017, CrowdPose, and OChuman datasets. The performance of crowded scenes, in particular, demonstrates the robustness of the proposed methods to deal with occlusion.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call