Image fusion enhances a single image by integrating information from multiple sources with complementary data. Present end-to-end fusion methods often face overfitting or intricate parameter tuning due to inadequate task-specific training data. To address this, two-stage approaches utilize encoder–decoder networks trained on extensive natural image datasets, yet suffer from limited performance due to domain disparities. In this work, we devise a novel encoder–decoder fusion framework and introduce a self-supervised scheme based on destruction–reconstruction. This approach facilitates task-specific feature learning by proposing three auxiliary tasks: pixel intensity non-linear transformation for multi-modal fusion, brightness transformation for multi-exposure fusion, and noise transformation for multi-focus fusion. By randomly selecting one task during model training, we mutually reinforce different fusion tasks, enhancing network generalizability. We innovate an encoder combining Convolutional Neural Network (CNN) and Transformer to extract both local and global features. Rigorous evaluations against 11 traditional and deep learning-based methods span four benchmark datasets: infrared-visible fusion, medical fusion, multi-exposure fusion, and multi-focus fusion. Comprehensive assessments, encompassing nine metrics from diverse viewpoints, consistently demonstrate the superior performance of our approach in all scenarios. We will make our code, datasets, and fused images publicly available.