Abstract

One-shot face reenactment aims to animate a source facial image into a different pose and expression provided by a driving image. Despite recent advancements showing promising results, challenges still exist in accurately transferring facial expressions and being robust to large-scale head movements. Additionally, the current method’s lack of interpretability in the network’s generalization remains a concern. We propose a novel and interpretable framework for one-shot face reenactment based on spatial–temporal reconstruction views to address these challenges. To achieve precise transfer of facial expressions and intuitive control over the generated images, we have opted for the powerful 3D Morphable Face Model (3DMM) as an immediate representation. Transferring motion parameters such as pose and expression from the driving image to the target face allows us to generate a 3D reconstructed face image that allows explicit control. To enhance the robustness of motion reconstruction, we introduce the concept of dense pseudo-temporal flow and propose a cross-modal cascade temporal motion reconstruction network to predict it. Our experimental results demonstrate its effectiveness in improving the robustness of motion reconstruction, especially for large-scale head movements. In addition, to further improve the generated image quality, we propose the motion-modulative residual fusion block as the information path to transfer the motion information from the motion reconstruction network to the local texture reconstruction network. Finally, we discussed the generalization ability of our framework based on its interpretable generation process. Extensive experiments demonstrate that our framework can generate highly realistic facial images and can be comparable with state-of-the-art methods.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call