Image generation from text with entity information fusion

Deyu Zhou,Kai Sun,Mingqi Hu,Yulan He

doi:10.1016/j.knosys.2021.107200

Abstract

Image generation from text is the task of generating new images from a textual unit such as word, phase, clause and sentence. It has attracted great attention in both the community of natural language processing and computer vision. Current approaches usually employ an end-to-end framework to tackle the problem. However, we find that the entity information, including categories and attributes of the images, are ignored by most approaches. Such information is crucial for guaranteeing semantic alignment and generating image accurately. For two pictures of the same category, the emphasis of the corresponding text description may be different, but the images generated by these two sentences should have some similarities and the generation process can learn from each other. Therefore, we propose two novel end-to-end frameworks to incorporate entity information in the process of image generation. For the first framework, an image representation is generated from entity labels using the variational inference mechanism and then fused with the representation generated from the corresponding sentence. Instead of fusing the images in high-dimensional space, images are inferred and fused in the latent space (the low-dimensional space) in the second framework, where computationally intensive upsampling modules are shared. Moreover, a novel metric (Entity Matching Score) is proposed to measure the degree of consistency of the generated image with its corresponding text description and the effectiveness of the metric has been proved by the generated samples in our experiments. Experimental results show that both the proposed frameworks outperform some state-of-the-art approaches significantly on two benchmark datasets.

Full Text