Self-Supervised Pose Adaptation for Cross-Domain Image Animation

Chaoyue Wang,Dacheng Tao,Chang Xu

doi:10.1109/tai.2020.3031581

Chaoyue Wang, Dacheng Tao + Show 1 more

Open Access

https://doi.org/10.1109/tai.2020.3031581

Copy DOI

Abstract

Image animation is to animate a still image of the object of interest using poses extracted from another video sequence. Through training on a large-scale video dataset, most existing approaches aim to explore disentangled appearance and pose representations of training frames. Then, the desired output with a specific appearance and pose can be synthesized via recombining learned representations. However, in some real-world applications, test images may lack the corresponding video ground-truth or follow a different distribution than the distribution of the training video frames (i.e., different domains), which largely limit the performance of existing methods. In this paper, we propose domain-independent pose representations that are compatible with and accessible by still images from a different domain. Specifically, we devise a two-stage self-supervised pose adaptation framework for general image animation tasks. A domain-independent pose adaptation generative adversarial network (DIPA-GAN) and a shuffle-patch generative adversarial network (Shuffle-patch GAN) are proposed to penalize the rationality of the synthesized frame's pose and appearance, respectively. Finally, experiments evaluated on various image animation tasks, which include same/cross-domain moving objects, facial expression transfer and human pose retargeting, demonstrate the superiority of the proposed framework over prior literature. Impact Statement —Image animation is a popular technology in video production. Benefiting from the rapid development of artificial intelligence (AI), recent image animation algorithms have been widely used in real-world applications, such as virtual AI news anchor, virtual try-on, and face swapping. However, most existing methods are designed for specific cases. To animate a new portrait, users are asked to collect hundreds of images of the same person and train a new model. The technology proposed in this paper overcomes these training limitations and generalizes image animations. In the challenging cross-domain facial expression transfer task, the user study demonstrated that our technology achieved more than 20% increase in animation success rate. The proposed technology could benefit users in a wide variety of industries including movie production, virtual reality, social media and online retail.

Full Text