Transformer-Based Visual Object Tracking with Global Feature Enhancement

Shuai Wang,Genwen Fang,Lei Liu,Silas N Melo,Kongfen Zhu,Jun Wang

doi:10.3390/app132312712

Abstract

With the rise of general models, transformers have been adopted in visual object tracking algorithms as feature fusion networks. In these trackers, self-attention is used for global feature enhancement. Cross-attention is applied to fuse the features of the template and the search regions to capture the global information of the object. However, studies have found that the feature information fused by cross-attention does not pay enough attention to the object region. In order to enhance cross-attention for the object region, an enhanced cross-attention (ECA) module is proposed for global feature enhancement. By calculating the average attention score for each position in the fused feature sequence and assigning higher weights to the positions with higher attention scores, the proposed ECA module can improve the feature information in the object region and further enhance the matching accuracy. In addition, to reduce the computational complexity of self-attention, orthogonal random features are introduced to implement a fast attention operation. This decomposes the attention matrix into the product of a random non-linear function between the original query and key. This module can reduce the spatial complexity and improve the inference speed by avoiding the explicit construction of a quadratic attention matrix. Finally, a tracking method named GFETrack is proposed, which comprises a Siamese backbone network and an enhanced attention mechanism. Experimental results show that the proposed GFETrack achieves competitive results on four challenging datasets.

Full Text