The tracking methods based on Transformer have shown great potential in visual tracking and achieved significant tracking performance. The traditional transformer based feature fusion network divides a whole feature map into multiple image patches as its inputs, and then directly processes them in parallel, which will occupy a lot of computing resources and affect the computing efficiency of multi-head attention. In this paper, we design a novel feature fusion network with optimized multi-head attention in encoder and decoder architecture based on Transformer. The designed feature fusion network preprocess the input features and change the calculations of multi-head attention by using both the efficient multi-head self-attention module and efficient multi-head spatial reduction attention module. The two modules can reduce the influence of irrelevant background information, enhance the representation ability of template features and search region features, and greatly reduce the computational complexity. We propose a novel Transformer tracking method (named EMAT) based on the designed feature fusion network. The proposed EMAT is evaluated on seven challenging tracking benchmarks to demonstrate its superiority, including LaSOT, GOT-10k, TrackingNet, UAV123, VOT2018, NfS and VOT-RGBT2019. The proposed tracker achieves well tracking performance, and obtains precision score of 89.0% on UAV123, AUC score of 64.6% on LaSOT, EAO score of 34.8% on VOT-RGBT2019, which outperforms most advanced trackers. EMAT runs at a real-time speed of about 35 FPS during tracking.
Read full abstract