Autonomous pollination robots have been widely discussed in recent years. However, the accurate estimation of flower poses in complex agricultural environments remains a challenge. To this end, this work proposes the implementation of a transformer-based architecture to learn the translational and rotational errors between the pollination robot’s end effector and the target object with the aim of enhancing robotic pollination efficiency in cross-breeding tasks. The contributions are as follows: (1) We have developed a transformer architecture model, equipped with two feedforward neural networks that directly regress the translational and rotational errors between the robot’s end effector and the pollination target. (2) Additionally, we have designed a regression loss function that is guided by the translational and rotational errors between the robot’s end effector and the pollination targets. This enables the robot arm to rapidly and accurately identify the pollination target from the current position. (3) Furthermore, we have designed a strategy to readily acquire a substantial number of training samples from eye-in-hand observation, which can be utilized as inputs for the model. Meanwhile, the translational and rotational errors identified in the end-manipulator Cartesian coordinate system are designated as loss targets simultaneously. This helps to optimize the training of the model. We conducted experiments on a realistic robotic pollination system. The results demonstrate that the proposed method outperforms the state-of-the-art method, in terms of both accuracy and efficiency.