Aerial image recognition (AIR) has received increasing interest in the remote sensing community, and complex backgrounds with various types of objects bring huge challenges to recognition. Vision transformer (ViT) can capture long-range contextual relations, but it tends to fall into over-fitting for training samples. This problem is particularly acute in AIR where trainable data is insufficient. While training ViT with label smoothing can alleviate this problem, it is easy to pose the risk of underfitting and relies heavily on data quality. Furthermore, the application of ViT in AIR has always been limited because aerial images generally have greater intra-class variation and inter-class similarity than natural images. Thus, we propose a ViT-based framework called D-BiT to extend ViT to learn discriminative features in aerial images by integrating two heads on the ViT backbone. One is a classifier head with the Discriminative Label Smoothing Module (DLSM) to fully learn effective features without overfitting the distribution of noisy data, even in the condition of insufficient data volume. Another is a projector head with a newly-designed supervised contrastive loss to effectively combine label information and achieve a more compact and reasonable intra-class structure. Experimental results on three popular AIR benchmarks demonstrate the effectiveness of D-BiT: Aerial Image Dataset (AID), UC Merced and Event Recognition in Aerial videos (ERA), where we achieve state-of-the-art (SOTA) performance. In particular, D-BiT can significantly improve the accuracy of ViT for recognizing aerial images.