Sparse Transformer Refinement Similarity Map for Aerial Tracking
Transformers have significantly enhanced tracking performance in visual tracking tasks. The key to its success lies in its powerful self-attention mechanism. Most current research focuses on using Transformers for feature extraction to concentrate on the target itself, with limited attention given to the refinement of similarity map. The refinement of the similarity map aims to adjust the focus on the target information itself. However, naive self-attention lacks the ability to prioritize the most important information and is easily influenced by other information. In this paper, we introduce a sparse attention mechanism to handle the refinement of the similarity map module, tackling this issue and achieving more accurate tracking. Additionally, we have constructed a more suitable encoder to effectively encode spatial features and temporal information. Extensive experiments demonstrate that our method (TCT+) achieves efficient aerial tracking, outperforming our baseline (TCTrack) on UAV123, DTB70, and UAV123@10fps. Our code is available at https://github.com/wolfwaytx/TCTPlus.