ET-DM: Text to image via diffusion model with efficient Transformer

Huan Li,Feng Xu,Zheng Lin

doi:10.1016/j.displa.2023.102568

Abstract

Text-to-image synthesis is widely used in many applications, such as virtual reality, game development, image editing, etc. It is a challenging task that requires the conversion of natural language descriptions into corresponding images. Text-to-image synthesis techniques based on generative adversarial networks (GANs) have succeeded greatly. However, there are still some challenges in generating high-quality, diverse, and semantically consistent images. To solve these problems. This paper proposes a novel text-to-image synthesis technique (ET-DM) based on a diffusion model and an efficient Transformer. ET-DM technology combines the diffusion model and the high-efficiency Transformer model. On the one hand, the diffusion model is used to simulate the evolution process of pixel values in the image, and the image is generated through repeated iterations. At the same time, it also uses an efficient Transformer model to process text input and generate corresponding images. In image generation, ET-DM technology can control the image at the pixel level to ensure the image’s visual consistency and semantic consistency. In addition, it can generate diverse images by controlling random noise. We conduct experiments on multiple datasets and show that ET-DM outperforms existing methods regarding image quality and diversity while being more computationally efficient. ET-DM represents a promising approach to image generation from textual descriptions, which can find applications in fields such as computer vision, natural language processing, and creative artificial intelligence.

Full Text