Semantic image synthesis aims to generate images from given semantic layouts, which is a challenging task that requires training models to capture the relationship between layouts and images. Previous works are usually based on Generative Adversarial Networks (GAN) or autoregressive (AR) models. However, the GAN model's training process is unstable, and the AR model’s performance is seriously affected by the independent image encoder and the unidirectional generation bias. Due to the above limitations, these methods tend to synthesize unrealistic, poorly aligned images and only consider single-style image generation. In this paper, we propose a Multi-model Style-aware Diffusion Learning (MSDL) framework for semantic image synthesis, including a training module and a sampling module. In the training module, a layout-to-image model is introduced to transfer the learned knowledge from a model pretrained with massive weak correlated text-image pairs data, making the training process more efficient. In the sampling module, we designed a map-guidance technique and creatively designed a multi-model style-guidance strategy for creating images in multiple styles, e.g., oil painting, Disney Cartoon, and pixel style. We evaluate our method on Cityscapes, ADE20K, and COCO-Stuff, making visual comparisons and computing with multiple metrics such as FID, LPIPS, etc. Experimental results demonstrate that our model is highly competitive, especially in terms of fidelity and diversity.
Read full abstract