The rotating parts of large and complex equipment are key components that ensure the normal operation of the equipment. Accurate fault diagnosis is crucial for the safe operation of these systems. To simultaneously extract both local and global valuable fault feature information from key components of complex equipment, this study proposes a fault diagnosis network model, named MultiDilatedFormer, which is based on the fusion of transformer and multi-head dilated convolution. The newly designed multi-head dilated convolution module is sequentially integrated into the transformer-encoder architecture, constructing a feature extraction module where the complementary advantages of both components enhance overall performance. Firstly, the sample is expanded into a two-dimensional feature map and then input into the newly designed feature extraction module. Finally, the diagnostic output is performed by the designed patch feature fusion module and classifier module. Additionally, interpretability research is conducted on the proposed model, aiming to understand the decision-making mechanism of the model through visual analysis of the entire decision process. The experimental results on three different datasets indicate that the proposed model achieved high accuracy in fault diagnosis with relatively short data windows. The highest accuracy reached 97.95%, which was up to 10.97% higher than other models. Furthermore, the feasibility of the model is also verified in the actual dataset of the rotating parts of the injection molding machine. The excellent performance of the model on different datasets demonstrates its effectiveness in extracting comprehensive fault feature information and also proves its great potential in practical industrial applications.