Abstract

Existing deep learning-based methods often follow either image-level or feature-level fusion frameworks to uniformly or separately extract features, ignoring the specialized interactive information learning, which may produce limited fusion performance. To tackle this challenge, we devise a powerful fusion baseline via adaptive interactive Transformer learning, namely AITFuse. Unlike previous methods, our network alternately incorporates local and global relationships through collaborative learning of both CNN and Transformer. In particular, we propose a cascaded token-wise and channel-wise Vision Transformer architecture with different attention mechanisms to model the long-range contexts, and allow feature communication across different tokens and independent channels in an interactive manner. On this basis, the modal-specific feature rectification module employs self-attention operation to revise distinctive features within the same domain for efficient encoding. Meanwhile, the cross-modal feature integration module constructs cross-attention mechanism to fuse complementary characteristics from different domains for multi-level decoding. In addition, we discard the learning position embedding to release our fusion model for the image of arbitrary sizes without splitting operations. Extensive experiments on mainstream datasets and downstream tasks demonstrate the rationality and superiority of our AITFuse. The codes will be available at https://github.com/Zhishe-Wang/AITFuse.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call