Abstract

Human pose estimation from images is an exceedingly challenging task due to occlusion, blur, illumination, and size variations. Several research works have addressed this problem by designing novel encoders that extract discriminative features from input images. However, the design of decoders is an understudied problem as most approaches use the conventional transposed convolution for feature upsampling. In this paper, we propose a novel transformer-based pose estimation architecture, HumanPoseNet with a novel decoder design. The encoder is a transformer with dual-attention for highlighting important spatial and channel regions. The decoder contains a novel Patch Expansion module for feature upsampling which is computationally more efficient than the traditional transpose convolution. The decoder also contains an Attentional Feature Refinement Block that is integrated with the attention mechanism to extract refined features. Extensive experiments conducted on the MS COCO benchmark dataset proves that HumanPoseNet beats other state-of-the-art models, achieving an average precision (AP) score of 77.3. A qualitative analysis is conducted and the results prove that HumanPoseNet is able to predict accurate pose keypoints, even in occluded/blurred images. An ablation study is done to verify the complementary contributions of Patch Expansion and Attentional Feature Refinement Module. HumanPoseNet is easy to train as the model converges swiftly in merely 50 epochs while existing state-of-the-art methods are usually trained for 200 epochs or more.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.