To address the challenges of inadequate preservation of prominent targets, poor retention of texture details, and unsatisfactory reconstruction of image backgrounds in image fusion. In this paper, a multi-scale residual attention fusion network based on semantic segmentation guidance is proposed, termed as MRASFusion. First of all, Swin Transformer segmentation mask with high precision, and strong scalability is adopted to avoid the inefficiency and error of manual segmentation mask. The mask generated by semantic segmentation is used to construct a loss function to guide the image fusion process. Secondly, in order to maintain the integrity of contextual information and texture details, a new feature extraction module is proposed to fully extract the meaningful features. Finally, the fused image is obtained by reconstructing the extracted features. To verify the effectiveness of the method, MRASFusion is qualitatively and quantitatively compared with nine state-of-the-art fusion methods on TNO and RoadScene datasets. Experimental results indicate that our method has demonstrated satisfactory performance in image fusion tasks, exhibiting superior capabilities in preserving target information and retaining texture details. Furthermore, our fusion results have brought some performance improvements for advanced vision tasks, i.e., improved accuracy for the object detection, which provides a better foundation for solving real-world problems.
Read full abstract