Abstract

Conducting reliable feature interaction plays a critical role in the visual tracking community, especially in recent dominated Siamese-based tracking paradigm. In general, there are two primary approaches for fusing representations from template and search area in the Siamese setting, i.e., cross-correlation and transformer modeling. The former provides a straightforward interaction solution, which may have limitations in handling complex scenarios, such as appearance variations and occlusion. While the latter offers an effective interaction mechanism, albeit with higher computation complexity and model cost. In contrast to traditional Siamese-based trackers which rely on two mentioned feature cross-correlation operators, this paper proposes a novel Correlation-Refine network to address the issue of lacking semantic information caused by local linear matching in correlation, from both spatial and channel perspectives. Correlation-Refine network (named CR) is solely built on top of fully convolutional layers, without employing intricate transformer mechanisms or complex methods to fuse features from multiple scales. Moreover, we present a concise yet effective convolutional tracking framework based on the correlation-refine network. CR network can increase the discriminative ability of semantic information in a coarse-to-fine manner: it gradually learns the semantic features of the target to be tracked and suppresses interference from similar objects by stacking multiple CR layers. Extensive experiments and comparisons with recent competitive trackers in challenging large-scale benchmarks demonstrate that, our tracker outperforms all previous convolutional trackers and has competitive results with transformer-based method. The code will be made available.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call