Abstract

Recent research on text-guided image style transfer using CLIP (Contrastive Language-Image Pre-training) models has made good progress. Existing work does not rely on additional generative models, but it cannot guarantee the quality of the generated images, and often suffers from problems such as distortion of content images and uneven stylization of the generated images. To address such problems, this work proposes the TextStyler model, a CLIP-based approach for text-guided style transfer. In the TextStyler model, we propose a style transformation network STNet, which consists of an encoder and a multi-scale decoder. The network can capture the hierarchical features of the content image, and the decoder feature fusion module in the network, designed based on the channel attention mechanism, helps the network to maximize the retention of the detailed information of the content image while realizing texture transfer. In addition, we design a patch-wise perceptual loss, which is able to transfer the stylized texture to each local region of the image and improve the balance of model stylization. The experimental results show that the TextStyler model can achieve a wider range of style transfer than existing methods using stylized images, and the generated artistic images are more in line with human visual perception than state-of-the-art text-guided style transfer methods.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call