Semantic Object Accuracy for Generative Text-to-Image Synthesis.

Tobias Hinz,Stefan Wermter,Stefan Heinrich

doi:10.1109/tpami.2020.3021209

Abstract

Generative adversarial networks conditioned on textual image descriptions are capable of generating realistic-looking images. However, current methods still struggle to generate images based on complex image captions from a heterogeneous domain. Furthermore, quantitatively evaluating these text-to-image models is challenging, as most evaluation metrics only judge image quality but not the conformity between the image and its caption. To address these challenges we introduce a new model that explicitly models individual objects within an image and a new evaluation metric called Semantic Object Accuracy (SOA) that specifically evaluates images given an image caption. The SOA uses a pre-trained object detector to evaluate if a generated image contains objects that are mentioned in the image caption, e.g., whether an image generated from "a car driving down the street" contains a car. We perform a user study comparing several text-to-image models and show that our SOA metric ranks the models the same way as humans, whereas other metrics such as the Inception Score do not. Our evaluation also shows that models which explicitly model objects outperform models which only model global image characteristics.

Highlights

G ENERATIVE adversarial networks (GANs) [1] are capable of generating realistic-looking images that adhere to characteristics described in a textual manner, e.g. an image caption
The Inception Score (IS) is improved by 16 − 19%, the R-precision by 6 − 7%, the Semantic Object Accuracy (SOA)-C by 28 − 33%, the SOA-I by 22 − 25%, the Frechet Inception Distance (FID) by 20 − 25%, and
Our model can generate an image containing a reasonable shape of a banana and a cup of coffee, whereas the other models only seem to generate the texture of a banana without the shape and completely ignore the cup of coffee

Summary

Introduction

G ENERATIVE adversarial networks (GANs) [1] are capable of generating realistic-looking images that adhere to characteristics described in a textual manner, e.g. an image caption. The textual description is used on multiple levels of resolution, e.g. first to obtain a course layout of the image at lower levels and to improve the details of the image on higher resolutions This approach has led to good results on simple, well-structured data sets containing a specific class of objects (e.g. faces, birds, or flowers) at the image center. Once images and textual descriptions become more complex, e.g. by containing more than one object and having a large variety in backgrounds and scenery settings, the image quality drops drastically This is likely because, until recently, almost all approaches only condition on an embedding of the complete textual description, without paying attention to individual objects. Generating complex scenes containing multiple objects from a variety of classes is still a challenging problem

Methods

Results

Conclusion