Abstract

Siamese trackers draw continuous attention in the object tracking community due to their proper balance between performance and inference speed. Despite that, it remains unclear how to effectively exploit the target appearance cues and motion cues involved in videos to improve trackers’ performance. To address this problem, we develop a Siamese network with diverse prior information integrated, namely DPINet, by extending two novel blocks to a powerful anchor-free Siamese network. First, we design a channel- and space-aware feature enhancement (CSE) block to highlight target-specific feature weights in two aspects (channel and spatial dimensions). It is devoted to making full use of the target cues in the initial frame by considering them as guidances, in which way target-related representation in feature maps can be improved. It also facilitates the interplay between two input branches. Second, we advance a cross-correlation block with multi-dimensional information fusion (MDI-XCorr). In this block, target motion cues within adjacent frames can be mined and treated as supervisions to refine the response map in the current frame during inference. Hence, both tracking quality and stabilization can be enhanced. Evaluations on four popular benchmarks are conducted, showing that DPINet achieves 0.702 (AUC), 0.474 (EAO), 0.336 (EAO), 0.613 (AO), and 0.527 (AUC) on OTB100, VOT2018, VOT2019, GOT-10k, and LaSOT, respectively.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call