In this paper, we propose MSA-ESRGAN, a novel super-resolution model designed to enhance the perceptual quality of images. The key innovation of our approach lies in the integration of a multi-scale attention U-Net discriminator, which allows for more accurate differentiation between subject and background areas in images. By leveraging this architecture, MSA-ESRGAN surpasses traditional methods and several state-of-the-art super-resolution models in terms of Natural Image Quality Evaluator (NIQE) scores as well as Peak Signal-to-Noise Ratio (PSNR) and Structural Similarity Index (SSIM) across various benchmark datasets, including BSD100, Set5, Set14, Urban100, and OST300. Additionally, subjective evaluations further confirm the enhanced visual quality delivered by MSA-ESRGAN, particularly in terms of preserving texture and overall image realism. To ensure a fair comparison with Real-ESRGAN, we initialized our generator with a pre-trained Real-ESRNET model and followed the same training setup. Our model was trained on the DIV2K dataset using high-resolution image patches and the Adam optimizer, incorporating exponential moving average (EMA) for stability and performance enhancement. Evaluations on multiple benchmark datasets demonstrate that MSA-ESRGAN consistently delivers superior perceptual quality, as evidenced by higher NIQE, PSNR, and SSIM scores compared to other methods. Specifically, our model shows significant improvements in both objective and subjective measures of image quality. Furthermore, an ablation study highlighted the critical role of our multi-scale attention U-Net discriminator in enhancing the model’s performance. The results underscore the effectiveness of MSA-ESRGAN in maintaining image naturalness and perceptual quality, providing a robust benchmark for blind super-resolution tasks.
Read full abstract