Improving text-to-image generation with object layout guidance

Jezia Zakraoui,Somaya Al-Maadeed,Moutaz Saleh,Jihad Mohammed Jaam

doi:10.1007/s11042-021-11038-0

Abstract

The automatic generation of realistic images directly from a story text is a very challenging problem, as it cannot be addressed using a single image generation approach due mainly to the semantic complexity of the story text constituents. In this work, we propose a new approach that decomposes the task of story visualization into three phases: semantic text understanding, object layout prediction, and image generation and refinement. We start by simplifying the text using a scene graph triple notation that encodes semantic relationships between the story objects. We then introduce an object layout module to capture the features of these objects from the corresponding scene graph. Specifically, the object layout module aggregates individual object features from the scene graph as well as averaged or likelihood object features generated by a graph convolutional neural network. All these features are concatenated to form semantic triples that are then provided to the image generation framework. For the image generation phase, we adopt a scene graph image generation framework as stage-I, which is refined using a StackGAN as stage-II conditioned on the object layout module and the generated output image from stage-I. Our approach renders object details in high-resolution images while keeping the image structure consistent with the input text. To evaluate the performance of our approach, we use the COCO dataset and compare it with three baseline approaches, namely, sg2im, StackGAN and AttnGAN, in terms of image quality and user evaluation. According to the obtained assessment results, our object layout guidance-based approach significantly outperforms the abovementioned baseline approaches in terms of the accuracy of semantic matching and realism of the generated images representing the story text sentences.

Highlights

Image generation for the task of story visualization aims to generate meaningful and coherent images representing the story text [16]
To consider the required object label specified in the scene graph in the layout prediction, we introduce an object layout module to aggregate the features of all objects and their relations specified in the scene graph
We conducted extensive experiments to evaluate our approach. We compared it with state-ofthe-art approaches for image synthesis, namely, sg2im, StackGAN and AttnGAN, and we showed its performance in aspects of semantic matching, object recognition, realism, and image quality

Summary

Introduction

Image generation for the task of story visualization aims to generate meaningful and coherent images representing the story text [16]. This is a challenging task since it requires a deep understanding of the objects involved in the story as well as their mutual interactions and semantical connections [17]. With the advent of large datasets such as COCO [1, 19] and the image synthesis models [22, 23, 26, 28, 36, 37] pairing images with natural language descriptions became possible without enormous efforts in contrast to discriminative methods. Two elephants are close to two sheep. Detailed qualitative and quantitative analysis of our results and comparisons against the baseline models are discussed

Objectives

Methods

Results

Conclusion