Abstract

Convolutional Neural Networks (CNNs) and Transformer are two powerful representation learning techniques for visual tracking. Although CNNs can effectively reduce local redundancy via small-neighborhood convolution operations, their limited receptive fields make it difficult to capture global dependency. Self-attention in Transformer uses patches as the input representation, which can effectively capture long-range dependency. However, blind similarity comparisons between all patches can lead to high redundancy. Is there then a technique that combines well the advantages of both paradigms for visual tracking? In this work, we design a novel backbone network for feature extraction. First, we choose Depthwise Convolution and Pointwise Convolution to build a Convolution Mixer, which effectively separates spatial mixing from channel-wise mixing of information. The Convolution Mixer reduces redundancy in spatial and channel features while increasing receptive field. Then, to exploit the global modeling ability of self-attention, we construct a module by aggregating Convolution Mixer and self-attention. The module shares dominant computational complexity (the square of the channel size) in the first stage. In the second stage, the shift and summation operations are lightweight. Finally, to alleviate the overfitting of the backbone network during training, a dropout layer is added at the end of the module to improve the generalization ability of the network model. Stronger image features are provided for subsequent feature fusion and prediction. The proposed tracker (named CMAT) achieves satisfying tracking performance on ten challenging datasets. In particular, CMAT achieves a 64.1% AUC on LaSOT and a 68.9% AUC on UAV123 while running at 23 frames per second (FPS).

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.