Abstract

Most of the current deep learning based image fusion methods heavily rely on convolutional operations for feature extraction. Recently, some Transformer-based image fusion models have emerged. However, most of them design complex attention mechanisms and still rely heavily on convolutions for local features modeling. With this goal, this paper proposes a novel and simple split-head dense Transformer based infrared and visible image fusion network, termed as SDTFusion. It consists of three parts: the feature extraction module, the inter-gating fusion module and the reconstruction module. Particularly, the feature extraction module is a pure Transformer network where an interactive split-head attention mechanism is designed to model the uni-modal and cross-modal long-range dependencies and promote cross-modal information extraction. Dense connections between Transformer blocks facilitate the reusability of feature maps. In the fusion module, the inter-gating mechanism is formulated as the element-wise product of cross-modal information, which can well retain competitive infrared brightness and distinct visible details. Moreover, a learnable detail injection module built on cross-attention mechanism injects fine-grained bi-modal information into multiple layers of the reconstruction module. Extensive experiments performed on three benchmark datasets show that SDTFusion achieves surprising fusion performance compared with nine state-of-the-art methods. In addition, the dominant capabilities of semantic segmentation and object detection also reveal the great advantage of our framework in promoting downstream visual tasks.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call