ASAFormer: Visual tracking with convolutional vision transformer and asymmetric selective attention

Xiaomei Gong,Yi Zhang,Shu Hu

doi:10.1016/j.knosys.2024.111562

Abstract

Recently, Vision Transformer (ViT) has exhibited remarkable performances in many computer vision tasks (e.g. object detection, segmentation and tracking). However, the output feature map of ViT is only single scale with low resolution, which may lose rich detailed semantic information. Meanwhile, ViT implements feature embedding through the linear projection, which makes it unable to capture local spatial context. Furthermore, as the core component of Transformer, self-attention captures the long-range dependencies at the cost of large memory footprint during training. In this paper, a novel hierarchical model is proposed to remedy the above issues. Firstly, convolutional vision Transformer is employed as our backbone for feature extraction and fusion. Secondly, a novel asymmetric structure is presented to calculate the cross-relation of the template and search branches. Thirdly, different selection operations are devised for the input of the attention module in both branches. Extensive experiments have been conducted on 5 mainstream benchmarks, which exhibits the superiority of our tracker. The code will be available at here.

Full Text