Abstract
Convolutional Neural Networks (CNNs) and Transformer are two powerful representation learning techniques for visual tracking. Although CNNs can effectively reduce local redundancy via small-neighborhood convolution operations, their limited receptive fields make it difficult to capture global dependency. Self-attention in Transformer uses patches as the input representation, which can effectively capture long-range dependency. However, blind similarity comparisons between all patches can lead to high redundancy. Is there then a technique that combines well the advantages of both paradigms for visual tracking? In this work, we design a novel backbone network for feature extraction. First, we choose Depthwise Convolution and Pointwise Convolution to build a Convolution Mixer, which effectively separates spatial mixing from channel-wise mixing of information. The Convolution Mixer reduces redundancy in spatial and channel features while increasing receptive field. Then, to exploit the global modeling ability of self-attention, we construct a module by aggregating Convolution Mixer and self-attention. The module shares dominant computational complexity (the square of the channel size) in the first stage. In the second stage, the shift and summation operations are lightweight. Finally, to alleviate the overfitting of the backbone network during training, a dropout layer is added at the end of the module to improve the generalization ability of the network model. Stronger image features are provided for subsequent feature fusion and prediction. The proposed tracker (named CMAT) achieves satisfying tracking performance on ten challenging datasets. In particular, CMAT achieves a 64.1% AUC on LaSOT and a 68.9% AUC on UAV123 while running at 23 frames per second (FPS).
Published Version
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have