<p>Wheat head detection is a critical task in precision agriculture for estimating crop yield and optimizing agricultural practices. Conventional object detection architectures often struggle with detecting densely packed and overlapping wheat heads in complex agricultural field images. To address this challenge, a novel CEnternet-vision TRansformer model for Wheat Head Detection (CETR) is proposed. CETR model combines the strengths of two cutting-edge technologies—CenterNet and Vision Transformer. A dataset of agricultural farm images labeled with precise wheat head annotations is used to train and evaluate the CETR model. Comprehensive experiments were conducted to compare CETR’s performance against convolutional neural network model commonly used in agricultural applications. The higher mAP value of 0.8318 for CETR compared against AlexNet, VGG19, ResNet152 and MobileNet indicates that the CETR model is more effective in detecting wheat heads in agricultural images. It achieves a higher precision in predicting bounding boxes that align well with the ground truth, resulting in more accurate and reliable wheat head detection. The higher performance of CETR can be attributed to the combination of CenterNet and ViT as a two-stage architecture taking advantage of both methods. Moreover, the transformer-based architecture of CETR enables better generalization across different agricultural environments, making it a suitable solution for automated agricultural applications.</p>