Convolutional neural networks (CNNs) have dominated the synthetic aperture radar (SAR) automatic target recognition (ATR) for years. However, under the limited SAR images, the width and depth of the CNN-based models are limited, and the widening of the received field for global features in images is hindered, which finally leads to the low performance of recognition. To address these challenges, we propose a Convolutional Transformer (ConvT) for SAR ATR few-shot learning (FSL). The proposed method focuses on constructing a hierarchical feature representation and capturing global dependencies of local features in each layer, named global in local. A novel hybrid loss is proposed to interpret the few SAR images in the forms of recognition labels and contrastive image pairs, construct abundant anchor-positive and anchor-negative image pairs in one batch and provide sufficient loss for the optimization of the ConvT to overcome the few sample effect. An auto augmentation is proposed to enhance and enrich the diversity and amount of the few training samples to explore the hidden feature in a few SAR images and avoid the over-fitting in SAR ATR FSL. Experiments conducted on the Moving and Stationary Target Acquisition and Recognition dataset (MSTAR) have shown the effectiveness of our proposed ConvT for SAR ATR FSL. Different from existing SAR ATR FSL methods employing additional training datasets, our method achieved pioneering performance without other SAR target images in training.