Abstract

Recently, Vision Transformer (ViT) has exhibited remarkable performances in many computer vision tasks (e.g. object detection, segmentation and tracking). However, the output feature map of ViT is only single scale with low resolution, which may lose rich detailed semantic information. Meanwhile, ViT implements feature embedding through the linear projection, which makes it unable to capture local spatial context. Furthermore, as the core component of Transformer, self-attention captures the long-range dependencies at the cost of large memory footprint during training. In this paper, a novel hierarchical model is proposed to remedy the above issues. Firstly, convolutional vision Transformer is employed as our backbone for feature extraction and fusion. Secondly, a novel asymmetric structure is presented to calculate the cross-relation of the template and search branches. Thirdly, different selection operations are devised for the input of the attention module in both branches. Extensive experiments have been conducted on 5 mainstream benchmarks, which exhibits the superiority of our tracker. The code will be available at here.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call