In a high-noise environment and with a limited number of faulty samples, it becomes challenging to extract a sufficient amount of useful fault information, which makes gear fault diagnosis more difficult. To address these issues, this paper proposes a fault diagnosis method for planetary gearboxes based on intrinsic feature extraction and attention mechanism. The method utilizes the complementary ensemble empirical mode decomposition algorithm to perform modal decomposition on the fault vibration signal, obtaining a series of modal components. By comparing and selecting the modal components that contain a significant amount of fault features, they are then transformed into two-dimensional images with time–frequency properties using wavelet transform. Additionally, a neural network model based on attention mechanism and large-scale convolution is proposed. The preprocessed images are inputted into the network for feature extraction. During this process, the large-scale convolution with residual structure maximizes the retention of effective feature information, while the attention network further filters the features. Finally, the selected features are used for fault classification. The model is validated using the gear datasets from Southeast University and the University of Connecticut. A comparison is made with the Pro-MobileNetV3, channel attention and multiscale convolutional neural network, multiscale dynamic adaptive residual network, and CBAM-ResNeXt50 models. It is found that the accuracy reaches 100% before adding Gaussian noise and 99.68% after adding noise, which is significantly higher than that of other models.