Date palm plantations in the United Arab Emirates (UAE) are under threat from soil salinity, drought, and date palm weevils. Accordingly, monitoring and conserving date palms are crucial to preserving a vital component of the country’s agricultural heritage, economy, food security, and ecological balance. Previous studies have effectively identified date palm trees using RGB-based aerial and UAV imagery utilizing diverse deep learning methods. However, the utilization of very high-resolution satellite data for delineating individual date palm crowns remains unexplored due to the limited spatial resolution capabilities of existing satellite systems. This study primarily aimed to achieve precise and comprehensive mapping of date palm trees using WorldView-3 (WV-3) satellite data by leveraging the high representational power of the state-of-the-art vision transformers (ViT) in capturing global information from the input data. First, an in-depth analysis assessment of the various transformer-based semantic segmentation architectures, including UperNet with vision transformer and Swin transformer, SegFormer, Mask2Former, and UniFormer, was conducted. Second, the integration of spectral data on the performance of ViTs was evaluated. Moreover, the models’ generalizability and complexity effect on the segmentation effectiveness were assessed. Accordingly, a postprocessing strategy was developed to aid in delineating and counting date palm trees from semantic segmentation outputs. Results demonstrated that integration of WV-3 spectral data into the analysis resulted in a marked improvement in segmentation quality. The UniFormer, UperNet-Swin, and Mask2Former models demonstrated considerable improvements in multispectral data analysis, with increases in mean intersection over union (mIoU) of 2.17% (77.88% mIoU, 86.01% mean F-score [mF-score]), 2% (78.10% mIoU, 86.18% mF-score), and 1.15% (77.36% mIoU, 85.59% mF-score), respectively, compared with their RGB-based results. Evaluations of model transferability also indicated that Mask2Former, UniFormer, and UperNet-Swin transformers efficiently adapted to multispectral data in the Dibba region. These models achieved mIoU scores of 84.36%, 84.25%, and 83.17% and mF-scores of 90.95%, 90.87%, and 90.13%, highlighting their effectiveness and potential for broader regional application. This research highlights the efficacy and feasibility of using ViTs with WV-3 multispectral data for accurate and comprehensive surveying of date palm plantations, enabling the development of palm tree inventories and continuously updating geospatial databases.