The fusion of multimodal longitudinal data is difficult but crucial for enhancing the accuracy of deep learning models for disease identification and helps provide tailored and patient-centric decisions. This study explores the fusion of multimodal data to detect the progression of Alzheimer’s disease (AD) using ensemble learning. We propose a heterogeneous ensemble framework of Bayesian-optimized time-series deep learning models to identify progressive deterioration of brain damage. Experimental results show that the heterogeneous ensemble of three models with patient’s temporal data outperforms all other variants of ensemble models by achieving an average performance of 95% for accuracy. We also propose a novel explainability approach, which enables domain experts and practitioners to better comprehend the model's final decision. The visual explainability of infected brain regions and the model's robustness is evaluated by our two medical domain experts showing its promising use in real medical environment. To evaluate the model’s generalizability and robustness, our optimized model is tested on a dataset with different distribution. The experiments demonstrate that the proposed model, which was trained on ADNI data, exhibits reliable generalization to NACC data with an average precision of 90%, recall of 91%, F1-score of 89%, AUC of 88%, and accuracy of 88%.