Visual tracking is a core component of intelligent transportation systems, especially for unmanned driving and road surveillance. Numerous convolutional neural network (CNN) trackers have achieved unprecedented performance. However, CNN features with regular spatial context relationships experience difficulty matching the rigid target templates when dramatic deformation and occlusion occur. In this paper, we propose a novel full Transformer Sub-patch Matching network for tracking (TSMtrack), which decomposes the tracked object into sub-patches, and interlaced matches the extracted sub-patches by leveraging the attention mechanism born with the Transformer. Roots in Transformer architecture, TSMtrack consists of image patch decomposition, sub-patch matching, and position prediction. Specifically, TSMtrack converts the whole frame into sub-patches and extracts the sub-patch features independently. By sub-patch matching and FFN-like prediction, TSMtrack enables independent similarity measurement between sub-patch features in an interlaced and iterative fashion. With a full Transformer pipeline implemented, we achieve a high-quality trade-off between tracking speed performance. Experiments on nine benchmarks demonstrate the effectiveness of our Transformer sub-patch matching framework. In particular, it realizes an AO of 75.6 on GOT-10K and SR of 57.9 on WebUAV-3M with 48 FPS on GPU RTX-2060s.
Read full abstract