Abstract

AbstractSiamese networks have attracted wide attention in visual tracking due to their competitive accuracy and speed. However, the existing Siamese trackers usually leverage a fixed linear aggregation of feature maps, which does not effectively fuse the different layers of features with attention. Besides, most of Siamese trackers calculate the similarity between the template and the search region through a cross‐correlation operation between the features of the last blocks from the two branches, which might introduce the redundant noise information. In order to solve these problems, this study proposes a novel Siamese visual tracking method via cross‐layer calibration fusion, termed SiamCCF. An attention‐based feature fusion module is employed using local attention and non‐local attention to fuse the features from the deep and shallow layers, so as to capture both local details and high‐level semantic information. Moreover, a cross‐layer calibration module can use the fused features to calibrate the features of the last network blocks and build the cross‐layer long‐range spatial and inter‐channel dependencies around each spatial location. Extensive experiments demonstrate that the proposed method has achieved competitive tracking performance compared with state‐of‐the‐art trackers on challenging benchmarks, including OTB100, OTB2013, UAV123, UAV20L, and LaSOT.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call