Multi-modal image fusion plays a crucial role in various visual systems. However, existing methods typically involve a multi-stage pipeline, i.e., feature extraction, integration, and reconstruction, which limits the effectiveness and efficiency of feature interaction and aggregation. In this paper, we propose MixFuse, a compact multi-modal image fusion framework based on Transformers. It smoothly unifies the process of feature extraction and integration, As its core, the Mix Attention Transformer Block (MATB) integrates the Cross-Attention Transformer Module (CATM) and the Self-Attention Transformer Module (SATM). The CATM introduces a symmetrical cross-attention mechanism to identify modality-specific and general features, filtering out irrelevant and redundant information. Meanwhile, the SATM is designed to refine the combined features via a self-attention mechanism, enhancing the internal consistency and proper preservation of the features. This successive cross and self-attention modules work together to enhance the generation of more accurate and refined feature maps, which are essential for later reconstruction. Extensive evaluation of MixFuse on five public datasets shows its superior performance and adaptability over state-of-the-art methods. The code and model will be released at https://github.com/Bitlijinfu/MixFuse.