Transformer-based visual object tracking via fine–coarse concatenated attention and cross concatenated MLP

Long Gao,Langkun Chen,Pan Liu,Yan Jiang,Yunsong Li,Jifeng Ning

doi:10.1016/j.patcog.2023.109964

Abstract

Transformer-based trackers have demonstrated promising performance in visual object tracking tasks. Nevertheless, two drawbacks limited the potential performance improvement of transformer-based trackers. Firstly, the static receptive field of the tokens within one attention layer of the original self-attention learning neglects the multi-scale nature in the object tracking task. Secondly, the learning procedure of the multi-layer perception (MLP) in the feed forward network (FFN) is lack of local interaction information among samples. To address the above issues, a new self-attention learning method, fine–coarse concatenated attention (FCA), is proposed to learn self-attention with fine and coarse granularity information. Moreover, the cross-concatenation MLP (CC-MLP) is developed to capture local interaction information across samples. Based on the two proposed modules, a novel encoder and decoder are constructed, and augmented in an all-attention tracking algorithm, FCAT. Comprehensive experiments on popular tracking datasets, OTB2015, LaSOT, GOT-10K and TrackingNet, reveal the effectiveness of FCA and CC-MLP, and FCAT achieves the state-of-art on the datasets.

Full Text