Abstract
Image animation creates visually compelling effects by animating still source images according to driving videos. Recent work performs animation on arbitrary objects using unsupervised methods and can relatively robustly perform motion transfer on human bodies. However, the complex representation of motion and unknown correspondence between human bodies often lead to issues such as distorted limbs and missing semantics, which make human animation challenging. In this paper, we propose a semantically guided, unsupervised method of motion transfer, which uses semantic information to model motion and appearance. Specifically, we use a pre-trained human parsing network to encode the rich and diverse foreground semantic information, thus generating fine details. Secondly, we use a cross-modal attention layer to learn the semantic region’s correspondence between human bodies to guide the network in selecting appropriate input features, prompting the network to generate accurate results. Experiments demonstrate that our method outperforms state-of-the-art methods in motion-related metrics, while effectively addressing the problems of semantic missing and unclear limb structures prevalent in human motion transfer. These improvements can facilitate its applications in various fields, such as education and entertainment.
Published Version
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have