Abstract

In the realm of visual tracking, remote sensing videos captured by Unmanned Aerial Vehicles (UAVs) have seen significant advancements with wide applications. However, there remain challenges to conventional Transformer-based trackers in balancing tracking accuracy and inference speed. This problem is further exacerbated when Transformers are extensively implemented at larger model scales. To address this challenge, we present a fast and efficient UAV tracking framework, denoted as SiamPT, aiming to reduce the number of Transformer layers without losing the discriminative ability of the model. To realize it, we transfer the conventional prompting theories in multi-model tracking into UAV tracking, where a novel self-prompting method is proposed by utilizing the target’s inherent characteristics in the search branch to discriminate targets from the background. Specifically, a self-distribution strategy is introduced to capture feature-level relationships, which segment tokens into distinct smaller patches. Subsequently, salient tokens within the full attention map are identified as foreground targets, enabling the fusion of local region information. These fused tokens serve as prompters to enhance the identification of distractors, thereby avoiding the demand for model expansion. SiamPT has demonstrated impressive results on the UAV123 benchmark, achieving success and precision rates of 0.694 and 0.890 respectively, while maintaining an inference speed of 91.0 FPS.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call