Infrared and visible image fusion (IVIF) has attracted increasing attention from the community because of its pleasing results in downstream applications. However, most existing deep fusion models are either feature–level fusion or image–level fusion, leading to information loss. In this paper, we propose an interactive transformer for IVIF, termed ITFuse. In contrast to previous algorithms, ITFuse consists of feature interactive modules (FIMs) and a feature reconstruction module (FRM) to alternatively extract and integrate important features. Specifically, to adequately exploit the common properties of different source images, we design a residual attention block (RAB) for mutual feature representation. To aggregate the distinct characteristics that existed in the corresponding input images, we leverage interactive attention (ITA) to incorporate the complementary information for comprehensive feature preservation and interaction. In addition, cross-modal attention (CMA) and transformer block (TRB) are presented to fully merge the capitalized features and construct long-range relationships. Furthermore, we devise a pixel loss and a structural loss to train the proposed deep fusion model in an unsupervised manner for excess performance amelioration. Massive experiments on popular databases illustrate that our ITFuse performs better than other representative state-of-the-art methods in terms of both qualitative and quantitative assessments. The source code of the proposed method is available at https://github.com/tthinking/ITFuse.
Read full abstract