Abstract

Recently, deep learning (DL) based trackers have attracted tremendous interest for their high performance. Despite the remarkable success, most trackers utilizing deep convolution features commonly neglect tracking speed, which is crucial for aerial tracking on mobile devices. In this paper, we propose an efficient and effective transformer based aerial tracker in the framework of Siamese, which inherits the merits from both transformer and Siamese architectures. Specifically, the outputs from multiple convolution layers are fed into transformer to construct robust features of template patch and search patch, respectively. Consequently, the interdependencies between low-level information and semantic information are interactively fused to improve the ability of encoding target appearance. Finally, traditional depth-wise cross correlation is introduced to generate a similarity map for object location and bounding box regression. Extensive experimental results on three popular benchmarks (DTB70, UAV123@10fps, and UAV20L) have demonstrated that our proposed tracker outperforms other 12 state-of-the-art trackers and achieves a real-time tracking speed of 71.3 frames per second (FPS) on GPU, which can be applied in mobile platform.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call