Multimodal neuroimaging provides complementary information critical for accurate early diagnosis of Alzheimer's disease (AD). However, the inherent variability between multimodal neuroimages hinders the effective fusion of multimodal features. Moreover, achieving reliable and interpretable diagnoses in the field of multimodal fusion remains challenging. To address them, we propose a novel multimodal diagnosis network based on multi-fusion and disease-induced learning (MDL-Net) to enhance early AD diagnosis by efficiently fusing multimodal data. Specifically, MDL-Net proposes a multi-fusion joint learning (MJL) module, which effectively fuses multimodal features and enhances the feature representation from global, local, and latent learning perspectives. MJL consists of three modules, global-aware learning (GAL), local-aware learning (LAL), and outer latent-space learning (LSL) modules. GAL via a self-adaptive Transformer (SAT) learns the global relationships among the modalities. LAL constructs local-aware convolution to learn the local associations. LSL module introduces latent information through outer product operation to further enhance feature representation. MDL-Net integrates the disease-induced region-aware learning (DRL) module via gradient weight to enhance interpretability, which iteratively learns weight matrices to identify AD-related brain regions. We conduct the extensive experiments on public datasets and the results confirm the superiority of our proposed method. Our code will be available at: https://github.com/qzf0320/MDL-Net.