In this paper, we tackle a special face completion task, facial displacement completion, which can offer a key component for many single image 3D face reconstruction systems. To produce a detailed 3D face with ear-to-ear complete displacement UV map, we propose a novel Displacement Completion method based on Transformer (DCT). Current transformer based image inpainting methods usually follow a two-stage scheme, which firstly recovers the masked pixels in low resolution with transformer, and then replenishes the inpainting result in high resolution with GAN. Although these methods have achieved great success, they suffer from information loss from two aspects when applied in face completion: 1) The downsampling operation makes transformer only produce a coarse appearance prior for GAN, incurring middle and low level information loss. 2) Some meaningful facial semantics can be well captured by transformer and further benefit the completion, but it's has not yet been explored. Motivated by the above considerations, we come up with three key designs in the proposed DCT: PCA tokenization, BERT-style learning, and style modulation. Firstly, we use PCA tokenization to replace the downsampling in transformer to preserve more meaningful structures. Secondly, we make transformer simulate the two tasks in BERT, Masked Language Model (MLM) and Next Sentence Prediction (NSP), for both masked pixels and facial attributes recovery. Thirdly, we encode the outcome of transformer as the latent code to guide an image translation network in the StyleGAN2 modulation way. Experments on both FaceScape dataset and in-the-wild data demonstrate DCT's better performance compared with other transformer based or GAN based completion methods.
Read full abstract