Abstract

Siamese trackers formulate the visual tracking task as a similarity matching problem through cross correlation. It is arduous for such methods to track targets with the presence of distractors. We suspect the reasons are twofold: 1) The irrelevant activated channels in the correlation map will produce ambiguous matching results. 2) The pipeline is a per-frame matching process and cannot handle the response aberrance caused by temporal context variation. In this paper, we propose a spatio-temporal matching process to thoroughly explore the capability of 4-D matching in space (height, width and channel) and time. In spatial matching, we introduce a space-variant instance-aware correlation (SI-Corr) to implement different channel-wise response recalibration for each matching position. SI-Corr can guide the generation of instance-aware features and distinguish the target and distractors at the instance level. In temporal matching, we design an aberrance repressed module (ARM) to investigate the short-term positional relationship between the target and distractors. ARM utilizes a simple optimization method to restrict the abrupt alteration of the interframe response maps, which allows the network to learn a temporal consistency of context structure distribution. Moreover, we efficiently embed temporal consistency into the inference process. Experiments on six benchmarks, including OTB100, VOT2018, VOT2020, GOT-10k, LaSOT and TrackingNet, demonstrate the state-of-the-art performance of the proposed method.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call