Target tracking is an important research task in computer vision. Existing tracking algorithms based on Siamese networks often suffer from the problem of information redundancy between adjacent frames and a lack of ability to capture global dependencies. When similar backgrounds appear around the target, the tracking performance usually significantly decreases. Although target tracking algorithms based on deep convolution and the Transformer have partially addressed these issues, achieving a good balance between the two remains a challenge. In this work, we propose a unified convolution and self-attention Siamese network for target tracking. By utilizing a feature extraction backbone network based on integrated convolution and self-attention styles, we are able to capture globally important regions and key frames while greatly reducing local redundant computations, thereby improving tracking performance. We apply this approach to the task of target tracking and enhance the feature extraction capabilities of both the target template and search template. Experimental results show that our proposed tracking algorithm outperforms some recent classical tracking algorithms, especially achieving improvements of 10.7 % on the high-diversity dataset GOT-10K and 24.7 % on the large-scale and high-quality LaSOT dataset.
Read full abstract