Dual attention-guided state controller mechanism for Text-to-Image generation

Borun Lai,Lihong Ma,Jing Tian,Zhonghao Zhang

doi:10.1109/tencon55691.2022.9978127

Abstract

The main challenge of text-to-image (Txt2Img) synthesis is how to enrich the image details while preserving the semantic consistency between text descriptions and generated images. To tackle this challenge, we propose a dual attention-guided state controller (DASC) mechanism for Txt2Img synthesis in this paper. Different from the conventional approaches that use an average pooling for extracting global semantic information, the proposed approach computes a new word-to-visual (W2V) attention together with the conventional visual-to-word (V2W) attention to form the dual attention. It extracts local semantic information, which is more relevant to each word at each image generation stage, in a recurrent manner by controlling the subsequent image generation states. Furthermore, guided by the local semantic information extracted in dual attention, a state controller is proposed to perform dynamic importance boosting for the mismatched words, and control the states of image generation by refining the generated image with rich details. Experiments are carried out to demonstrate the superior performance of the proposed approach in Txt2Img synthesis on the benchmark CUB and MS-COCO datasets.

Full Text