Multistage text-to-image generation algorithms have shown remarkable success. However, the images produced often lack detail and suffer from feature loss. This is because these methods mainly focus on extracting features from images and text, using only conventional residual blocks for post-extraction feature processing. This results in the loss of features, greatly reducing the quality of the generated images and necessitating more resources for feature calculation, which will severely limit the use and application of optical devices such as cameras and smartphones. To address these issues, the novel High-Detail Feature-Preserving Network (HDFpNet) is proposed to effectively generate high-quality, near-realistic images from text descriptions. The initial text-to-image generation (iT2IG) module is used to generate initial feature maps to avoid feature loss. Next, the fast excitation-and-squeeze feature extraction (FESFE) module is proposed to recursively generate high-detail and feature-preserving images with lower computational costs through three steps: channel excitation (CE), fast feature extraction (FFE), and channel squeeze (CS). Finally, the channel attention (CA) mechanism further enriches the feature details. Compared with the state of the art, experimental results obtained on the CUB-Bird and MS-COCO datasets demonstrate that the proposed HDFpNet achieves better performance and visual presentation, especially regarding high-detail images and feature preservation.
Read full abstract