Automatic font generation is of great benefit to improving the efficiency of font designers. Few-shot font generation aims to generate new fonts from a few reference samples, and has recently attracted a lot of attention from researchers. This is valuable but challenging, especially for ideograms with high diversity and complex structures. Existing models based on convolutional neural networks (CNNs) struggle to generate glyphs with accurate font style and stroke details in the few-shot setting. This paper proposes the TransFont, exploiting the long-range dependency modeling ability of the Vision Transformer (ViT) for few-shot font generation. For the first time, we empirically show that the ViT is better at glyph image generation than CNNs. Furthermore, based on the observation of the high redundancy in the glyph feature map, we introduce the glyph self-attention module for mitigating the quadratic computational and memory complexity of the pixel-level glyph image generation, along with several new techniques, i.e., multi-head multiple sampling, yz axis convolution, and approximate relative position bias. Extensive experiments on two Chinese font libraries show the superiority of our method over existing CNN-based font generation models, the proposed TransFont generates glyph images with more accurate font style and stroke details.