The intention of infrared and visible image fusion is to combine the images captured by different modal sensors in the same scene to enhance its understanding. Deep learning has been proven its powerful application in image fusion due to its fine generalization, robustness, and representability of deep features. However, the performance of these deep learning-based methods heavily depends on the illumination condition. Especially in dark or exposed scenes, the fused results are over-smoothness and low-contrast, resulting in inaccuracy of object detection. To address these issues, this paper develops a multi-stage feature learning approach with channel-spatial attention mechanism, namely MSCS, for infrared and visible image fusion. The MSCS is composed of the following four key procedures: Firstly, the infrared and visible images are decomposed into illumination and reflectance components by a proposed network called as Retinex_Net. Then, the components are transported to an encoder for features coding. Next, we propose an adaptive fusion module with attention mechanisms to fuse the features. Finally, the fused image is generated by the decoder for decoding the fused features. Meanwhile, a novel fusion loss function and a multi-stage training strategy are proposed to train the above modules. The subjective and objective results of experiments on TNO, LLVIP and MSRS datasets illustrate that the proposed method is effective and performs better than the state-of-the-art fusion methods on achieving enjoyable results in dark or over-exposure scenes. And the results of further experiments on the fused images for object detection demonstrate that the fusion outputs produced by our MSCS are more beneficial for detection tasks.
Read full abstract