The application of convolutional neural network (CNN) has greatly promoted the scope and scenario of intelligent fault diagnosis and brought about a significant improvement of intelligent model performance. Solving the feature extraction and fault diagnosis of machinery with heavy noise is beneficial for stable industrial production. However, the local properties of CNN prevent it from obtaining global features to collect sufficient fault information, leading to the degradation of fault diagnosis performance of CNN under heavy noise. In this article, a novel framework named Convformer-NSE is developed to extract robust features that integrate both global and local information, aiming at improving the end-to-end fault diagnostic performance of gearbox under heavy noise. First, Convformer is constructed to improve the nonlinear representation of the feature map, in which the sparse modified multi self-attention is used to model the long-range dependency of the feature map while keeping attention on local features. Then, the extracted spatial features at various scales are fused and fed in the designed novel Senet (NSE) for channel adaptivity learning. The Convformer-NSE is used for the analysis of raw vibration data of different gearbox systems. The experimental signal analyses demonstrate that our developed framework is superior to others.