Human pose transfer aims to synthesize referred human images with target pose, bringing the substantial economic potential for E-commerce or virtual reality. In this paper, we propose a novel method, the Attentional Pixel-wise Deformation Network (APD-Net), for synthesizing human images with guided pose and referred images. Specifically, we leverage attention-based spatial transformation modules and affine transformation modules to generate accurate appearance and extract pixel-wise details in local regions to generate intermediate results. Additionally, we introduce a confidence map to refine spatial information during the final image synthesis. Domain alignment loss, cycle loss, perceptual and feature matching loss and contextual loss are applied to constrain the synthesized images while attention loss and fusion loss benefit warp images generation. We verify the efficacy of the model on the Market-1501 and DeepFashion datasets, using quantitative and qualitative measures. Our approach surpasses all previously published state-of-the-art results on most evaluation metrics, e.g., achieving 0.780 SSIM score, 9.55 Sliced Wasserstein Distance score, and a 0.963 Semantic Consistency score on DeepFashion and obtaining 0.303 SSIM score, 16.971 Sliced Wasserstein Distance score and 0.729 Semantic Consistency score on Market-1501 Code and pretrained models are available at: https://github.com/LiaoFJ/APD-Net/.
Read full abstract