Multiple templates transformer for visual object tracking

Haibo Pang,Jie Su,Rongqi Ma,Tingting Li,Chengming Liu

doi:10.1016/j.knosys.2023.111025

Abstract

Matching the similarity between a template and search region is crucial in Siamese trackers. However, due to the limited information provided by a fixed template, existing trackers are not robust enough in complex scenarios, such as severe deformation, background clutters, out-of-view, illumination variation, low resolution, scale variation, fast motion, and full occlusion. Therefore, it is essential to use an informative template. Additionally, since the Transformer has superior model capability compared to traditional cross-correlation in tracking, some Siamese trackers have integrated Transformers and achieved exceptional performance. In this paper, we present a novel tracking architecture with Multiple Templates Transformer (MTT) to address the above issues. By utilizing multiple templates, the proposed method can grasp more contextual information and historical changes about the target, which can be leveraged to enhance the response in the search region using an encoder-decoder framework. We also explore different mechanisms to fuse templates effectively to achieve higher accuracy. We evaluate MTT in several famous benchmarks such as GOT-10k, TrackingNet, UAV123, OTB2015, VOT2018, and LaSOT. Extensive experimental results indicate that our tracker is capable of achieving better robustness in the face of different challenges while maintaining a considerable real-time speed.

Full Text