Recently, one-stream pipelines have made significant progress in visual object tracking (VOT), where the template and search images interact in early stages. However, one-stream pipelines have a potential problem: They treat the object and the background equally (or other irrelevant parts), leading to weak discriminability of the extracted features. To remedy this issue, a restricted token interaction module based on asymmetric attention mechanism is proposed in this paper, which divides the search image into valuable part and other part. Only the valuable part is selected for cross-attention with the template so as to better distinguish the object from the background, which finally improves the localization accuracy and robustness. In addition, to avoid heavy computational overhead, we utilize logit distillation and localization distillation methods to optimize the outputs of the classification and regression heads respectively. At the same time, we separate the distillation regions and apply different knowledge distillation methods in different regions to effectively determine which regions are most beneficial for classification or localization learning. Extensive experiments have been conducted on mainstream datasets in which our tracker (dubbed RIDTrack) has achieved appealing results while meeting the real-time requirement.
Read full abstract