Abstract
Existing deep learning-based methods often follow either image-level or feature-level fusion frameworks to uniformly or separately extract features, ignoring the specialized interactive information learning, which may produce limited fusion performance. To tackle this challenge, we devise a powerful fusion baseline via adaptive interactive Transformer learning, namely AITFuse. Unlike previous methods, our network alternately incorporates local and global relationships through collaborative learning of both CNN and Transformer. In particular, we propose a cascaded token-wise and channel-wise Vision Transformer architecture with different attention mechanisms to model the long-range contexts, and allow feature communication across different tokens and independent channels in an interactive manner. On this basis, the modal-specific feature rectification module employs self-attention operation to revise distinctive features within the same domain for efficient encoding. Meanwhile, the cross-modal feature integration module constructs cross-attention mechanism to fuse complementary characteristics from different domains for multi-level decoding. In addition, we discard the learning position embedding to release our fusion model for the image of arbitrary sizes without splitting operations. Extensive experiments on mainstream datasets and downstream tasks demonstrate the rationality and superiority of our AITFuse. The codes will be available at https://github.com/Zhishe-Wang/AITFuse.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.