Abstract

Currently, most thermal infrared (TIR) trackers rely on feature matching between the search image and a fixed template cropped from the first frame. Some Siam-based TIR trackers with a template update mechanism introduce historical prediction information in the temporal dimension through correlation filters. However, their feature characterization capability is inadequate to resist target scale variations, appearance changes, and occlusion. To address this challenge, we explore a novel spatio-temporal fusion Transformer (STFT) model to realize robust TIR object tracking. Our approach involves a Transformer-based encoder–decoder that fuses spatio-temporal information. Specifically, we design a dynamic template update strategy based on salient points feature(SPF) representation, which allows the model to leverage the most powerful spatio-temporal information by retrieving multiple salient points on the target image. To further fortify the dynamic template update strategy, we propose an IoU-Aware target state estimation head that utilizes the joint representation of target classification and localization. An IoU-Aware criterion is developed for quality estimation of the dynamic template. The proposed STFT-Net approach has been put to the evaluation on three challenging benchmarks, with extensive experimental results showcasing its superior performance in contrast to acclaimed tracking algorithms. The code is available at https://github.com/qinxin-wh/STFT-Net.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call