Abstract

Implementing the transformer for global fusion is a novel and efficient method for pose estimation. Although the computational complexity of modeling dense attention can be significantly reduced by pruning possible human tokens, the accuracy of pose estimation still suffers from the problem of high overlap of candidate regions and severe background noise. Moreover, the undifferentiated fusion of features from different views also leads to a sizeable effective information loss. To address these challenges, we propose a Gated Region-Refine Pose Transformer (GRRPT) for human pose estimation. The proposed GRRPT can obtain the general area of the human body from the coarse-grained tokens and then embed it into the fine-grained ones to extract more details of the joints. Experimental results on COCO demonstrate that performing the Multi-Resolution Attention mechanism learns more refined candidate regions and improves accuracy. Furthermore, we design a Fusion Gate module consisting of two gates to pixel-wise select valid information from the auxiliary views, which significantly alleviates information redundancy. Finally, we evaluate the effectiveness of our method on Human3.6M and our dataset FDU-Motion and achieve state-of-the-art performance.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.