The object tracking algorithm TransT, based on Transformer, achieves significant improvements in accuracy and success rate by fusing the extracted features of convolutional neural networks with the structure of Transformer. However, when dealing with the deformation of the object’s appearance, the algorithm exhibits issues such as insufficient tracking accuracy and drift, which directly affect the stability of the algorithm. In order to overcome this problem, this paper demonstrates how to expand scene information at the scale level during the fusion process and, on this basis, achieve accurate recognition and positioning. The predicted results are promptly fed back to the subsequent tracking process, from which temporal templates are embedded. Starting from both location and time can effectively improve the adaptive ability of the tracking model. In the final experimental comparison results, the algorithm proposed in this paper can adapt well to the situation of object deformation, and the overall performance of the tracking model has also been improved.