Abstract
High-quality colorization of grayscale images using text descriptions presents a significant challenge, especially in accurately coloring small objects. The existing methods have two major flaws. First, text descriptions typically omit size information of objects, resulting in text features that often lack semantic information reflecting object sizes. Second, these methods identify coloring areas by relying solely on low-resolution visual features from the Unet encoder and fail to leverage the fine-grained information provided by high-resolution visual features effectively. To address these issues, we introduce the Semantic-Enhanced Multi-scale Approach for Text-Guided Grayscale Image Colorization (SEMACOL). We first introduce a Cross-Modal Text Augmentation module that incorporates grayscale images into text features, which enables accurate perception of object sizes in text descriptions. Subsequently, we propose a Multi-scale Content Location module, which utilizes multi-scale features to precisely identify coloring areas within grayscale images. Meanwhile, we incorporate a Text-Influenced Colorization Adjustment module to effectively adjust colorization based on text descriptions. Finally, we implement a Dynamic Feature Fusion Strategy, which dynamically refines outputs from both the Multi-scale Content Location and Text-Influenced Colorization Adjustment modules, ensuring a coherent colorization process. SEMACOL demonstrates remarkable performance improvements over existing state-of-the-art methods on public datasets. Specifically, SEMACOL achieves a PSNR of 25.695, SSIM of 0.92240, LPIPS of 0.156, and FID of 17.54, surpassing the previous best results (PSNR: 25.511, SSIM: 0.92104, LPIPS: 0.157, FID: 26.93). The code will be available at https://github.com/ChchNiu/SEMACOL.
Published Version
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have