Abstract

Image generation from scene graphs has traditionally focused on predicting layout from the scene graph using graph convolutional networks firstly, then converting the layout to an image. These methods might involve complex architectures, auxiliary losses, or side information such as object location or image crops during training. We propose a new method based on transformer for end-to-end image generation from scene graphs. To convert the scene graphs to the input format required by the transformer, we propose a novel method to tokenize the scene graphs. Furthermore, a discrete autoencoder is used to map the image to abstract space. Finally we train the transformer predicting image tokens conditioned on the scene graphs in an autoregressive way. Experiments on Visual Genome and Coco-Stuff datasets show that our method significantly outperforms the state-of-the-art methods in terms of image quality.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call