Vision Transformer (ViT) shows potential in bearing fault diagnosis due to its multi-head self-attention mechanism and parallel feature extraction network which are efficient to achieve the robust complete feature representation of the fault. However, its adaption to the noise interference relies on the sufficient huge amount of training samples to prepare the local features of the fault and may suffer performance degradation when only a limited number of samples are available for the model training. To combat this challenge, an improved ViT diagnosis model based on the local feature expansion, i.e., LFE-ViT, is proposed. An auxiliary feature extraction block is introduced using a local feature expansion network and works as a parallel module with the ViT encoder. Through the enlargement of the receptive field, the multi-scale local features on a high dimensional space are available upon the limited samples. Then, through a feature embedding channel, the extracted local features are transmitted to the ViT encoder. Finally, by virtue of the multi-head self-attention mechanism to capture the time sequence global information, a fault diagnosis model comprising comprehensively local and global feature information is derived. Experimental validation on the bearing fault dataset from Case Western Reserve University shows that LFE-ViT has provided a rather satisfactory diagnosis performance under limited samples and noise environment.