Accurate classification of benign and malignant pulmonary nodules using chest computed tomography (CT) images is important for early diagnosis and treatment of lung cancer. In terms of natural image classification, the ViT-based model has greater advantages in extracting global features than the traditional CNN model. However, due to the small image dataset and low image resolution, it is difficult to directly apply the ViT-based model to pulmonary nodule classification. To propose and test a new ViT-based MSM-ViT model aiming to achieve good performance in classifying pulmonary nodules. In this study, CNN structure was used in the task of classifying pulmonary nodules to compensate for the poor generalization of ViT structure and the difficulty in extracting multi-scale features. First, sub-pixel fusion was designed to improve the ability of the model to extract tiny features. Second, multi-scale local features were extracted by combining dilated convolution with ordinary convolution. Finally, MobileViT module was used to extract global features and predict them at the spatial level. CT images involving 442 benign nodules and 406 malignant nodules were extracted from LIDC-IDRI data set to verify model performance, which yielded the best accuracy of 94.04% and AUC value of 0.9636 after 10 cross-validations. The proposed new model can effectively extract multi-scale local and global features. The new model performance is also comparable to the most advanced models that use 3D volume data training, but its occupation of video memory (training resources) is less than 1/10 of the conventional 3D models.