Alzheimer’s disease (AD) is identified as a central nervous system disease that exhibits irreversible degeneration, while mild cognitive impairment (MCI) is viewed as the preliminary stage of AD, and its pathogenesis is notably intricate. MCI contains two stages: early MCI (EMCI), and late MCI (LMCI). EMCI diagnosis can prevent EMCI from progressing to LMCI, and then to AD. Therefore, accurate diagnosis of EMCI/LMCI is crucial for developing the early intervention and treatment strategies of AD. Currently, most existing EMCI/LMCI diagnostic methods use single modality images, while different modality images carry different complementary information that helps for accurate diagnosis of EMCI/LMCI, and the lesion area is usually not limited to a single brain area, which involves multiple regions. In this case, conventional convolution operations cannot be able to accurately extract the pathological features of AD. In this work, we propose a novel Multi-scale fully Separable Convolution neural network with Large Kernels (MSCLK) method to diagnose early Alzheimer’s disease with structural Magnetic Resonance Imaging (sMRI) images. MSCLK mainly consists of the multi-scale 3D fully separable convolution modules and the deep metric learning module. The multi-scale convolution that contains both small and large kernels is used to effectively capture the discrimination features of different scale acceptance domains. 3D fully separable convolution is used to reduce parameters and overfitting. The deep metric learning is used to learn hard samples that are similar but belong to different classes. We also propose a variant method of MSCLK (called MSCLK-Fusion MRI and PET, MSCLK-FMP) by adding the pixel-level fusion module and feature-level fusion module into the MSCLK framework to integrate the sMRI image and the Positron Emission Computed Tomography (PET) image for further improving the accuracy of EMCI vs. LMCI classification task. The pixel-level fusion is used to achieve early pixel-level fusion of sMRI and PET images, and the feature-level fusion is used to achieve high-dimensional feature-level fusion of sMRI and PET images. Experimental results on the ADNI database show that the performance of our MSCLK and MSCLK-FMP are superior to other state-of-the-art methods. The accuracy of MSCLK achieves 98.89%, 95.97%, 96.39% and 98.76% for AD vs. EMCI, AD vs. LMCI, EMCI vs. NC and LMCI vs. NC classification tasks, respectively, and MSCLK-FMP achieves 93.93% for EMCI vs. LMCI classification task, indicating that MSCLK/MSCLK-FMP can be effectively used for diagnosing MCI patients. Moreover, our MSCLK-FMP is capable of pinpointing key brain areas involved in the pathological progression of MCI, such as the Temporal_Inf, the Hippocampus, the Precuneus, the Precentral, and the Thalamus. These findings contribute to uncovering the early onset of AD pathogenesis.