Abstract

Human pose estimation is a challenging visual task that relies on spatial location information. To improve the performance of human pose estimation, it is important to accurately determine the constraint relationship among keypoints. To address this, we propose MfvPose, a novel hybrid model that leverages rich multi-scale information. The proposed model incorporates the HRFOV module, which uses cascaded atrous convolution to maintain high-resolution representations of the backbone extractor and enrich the multi-scale information. In addition, we introduce learnable scalar weights to the Transformer encoder. In detail, it involves a multiplication by a diagonal matrix with learnable scalar weights on output of each residual block, which improves the dynamics of model training and enhances the accuracy of human pose estimation. It is experimentally shown that our proposed MfvPose achieves promising results on various benchmarks.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call