ABSTRACT Multimodal medical image fusion aims to aggregate significant information based on the characteristics of medical images from different modalities. Existing research in image fusion faces several major limitations, including a scarcity of paired data, noisy and inconsistent modalities, a lack of contextual relationships, and suboptimal feature extraction and fusion techniques. In response to these challenges, this research proposes a novel adaptive fusion approach. Our knowledge distillation (KD) model extracts informative features from multimodal medical images using various key components. A teacher network is employed to emphasise the suitability and complexity of capturing high-level abstract features. The soft labels are utilised to transfer the knowledge between the teacher network as well as the student network. During student network training, we minimise the divergence between these soft labels. To enhance the adaptive fusion of extracted features from different modalities, we apply a self-attention mechanism. Training this self-attention mechanism minimises the loss function, encouraging attention scores to capture relevant contextual relationships between features. Additionally, a cross-modal consistency module aligns the extracted features to ensure spatial consistency and meaningful fusion. Our adaptive fusion strategy effectively combines features to enhance the diagnostic value and quality of fused images. We employ generator and discriminator architectures for synthesising fused images and distinguishing between real and generated fused images. Comprehensive analysis is conducted on the basis of diverse evaluation measures. Experimental results demonstrate improved fusion outcomes with values of 0.92, 41.58, 7.25, 0.958, 0.759, 0.947, 0.90, 7.05, 0.0726, and 76 s for SSIM, PSNR, FF, VIF, UIQI, FMI, EITF, entropy, RMSE, and execution time, respectively.