Abstract

In visual tracking, the Transformer architecture is widely used because it can capture the global dependencies of sequence data without inductive bias. However, the attention mechanism of Transformer will bring ultra-high computational complexity and space occupancy, so that the tracking task cannot meet the real-time requirements. In this paper, we explore a sparsity region-aware attention mechanism. The sparse attention mechanism retains the regions with semantic relevance, and performs fine-grained attention calculation in this region. In the region-aware attention mechanism, a DropKey technique is introduced to reduce model over-fitting and improve the generalization ability of the model. Using region-aware attention as the basic building block, we design a dynamic region-aware Transformer backbone for visual tracking. This backbone network can effectively reduce the computational complexity while exploring global context dependencies. Based on the region-aware Transformer backbone network, this paper proposes a dynamic region-aware Transformer backbone visual tracking algorithm, which uses an optimization based model predictor to fully fuse object appearance and background information, so as to achieve more robust object tracking. The proposed tracker is trained in an end-to-end manner and experimentally evaluated on eight tracking benchmarks. Experimental results show that the algorithm has good tracking performance, especially in the application of unmanned aerial vehicle (UAV) tracking, our proposed tracker achieves an area under curve (AUC) score of 66.5% on the UAV123 dataset. Code is available at https://github.com/YSGFF/RTDiMP.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call